Data mining Project - University of Pisa, acedemic year 2023/24
Authors: Giacomo Aru, Giulia Ghisolfi, Luca Marini, Irene Testa
We import the libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt, mpld3
from matplotlib.legend_handler import HandlerPathCollection
import seaborn as sns
import matplotlib.dates as mdates
import plotly.express as px
import plotly.offline as pyo
import plotly.subplots as sp
from plotly_calplot import calplot
import plotly.graph_objs as go
import math
import os
import calendar
import sys
from sklearn.neighbors import KNeighborsClassifier
from geopy import distance as geopy_distance
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from pyproj import Transformer
import zipfile
import builtins
sys.path.append(os.path.abspath('..'))
from plot_utils import *
We define constants and settings for the notebook:
%matplotlib inline
DATA_FOLDER_PATH = '../data/'
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
We load the datasets:
incidents_path = DATA_FOLDER_PATH + 'incidents.csv'
elections_path = DATA_FOLDER_PATH + 'year_state_district_house_cleaned.csv'
poverty_path = DATA_FOLDER_PATH + 'poverty_by_state_year_cleaned.csv'
incidents_df = pd.read_csv(incidents_path, low_memory=False)
elections_df = pd.read_csv(elections_path, low_memory=False)
poverty_df = pd.read_csv(poverty_path, low_memory=False)
We assess the correct loading of the dataset printing 2 random rows:
incidents_df.sample(n=2, random_state=1)
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8141 | 2015-12-10 | Illinois | Chicago | 2700 block of South Lawndale | 41.8423 | -87.7174 | 4.0 | 24.0 | 12.0 | 20.0 | Adult 18+ | Male | 20.0 | 20.0 | 20.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 1.0 | Little Village - documented gang member shot multiple times, critical | Shot - Wounded/Injured | Gang involvement |
| 64095 | 2017-11-05 | Hawaii | Honolulu | NaN | 21.3061 | -157.8610 | 1.0 | 26.0 | 13.0 | 35.0 | Adult 18+ | Male | 35.0 | 35.0 | 35.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 1.0 | dropped off at hospital, unclear where he was shot | Shot - Wounded/Injured | NaN |
This dataset contains information about gun incidents in the USA.
In the following table we provide the characteristics of each attribute of the dataset. To define the type of the attributes we used the categorization described by Pang-Ning Tan, Michael Steinbach and Vipin Kumar in the book Introduction to Data Mining. For each attribute, we also reported the desidered pandas dtype for later analysis.
| # | Name | Type | Description | Desired dtype |
|---|---|---|---|---|
| 0 | date | Numeric (Interval) | Date of incident occurrence | datetime |
| 1 | state | Categorical (Nominal) | Dtate where incident took place | object |
| 2 | city_or_county | Categorical (Nominal) | City or county where incident took place | object |
| 3 | address | Categorical (Nominal) | Address where incident took place | object |
| 4 | latitude | Numeric (Interval) | Latitude of the incident | float64 |
| 5 | longitude | Numeric (Interval) | Longitude of the incident | float64 |
| 6 | congressional_district | Categorical (Nominal) | Congressional district where the incident took place | int64 |
| 7 | state_house_district | Categorical (Nominal) | State house district | int64 |
| 8 | state_senate_district | Categorical (Nominal) | State senate district where the incident took place | int64 |
| 9 | participant_age1 | Numeric (Ratio) | Exact age of one (randomly chosen) participant in the incident | int64 |
| 10 | participant_age_group1 | Categorical (Ordinal) | Exact age group of one (randomly chosen) participant in the incident | object |
| 11 | participant_gender1 | Categorical (Nominal) | Exact gender of one (randomly chosen) participant in the incident | object |
| 12 | min_age_participants | Numeric (Ratio) | Minimum age of the participants in the incident | int64 |
| 13 | avg_age_participants | Numeric (Ratio) | Average age of the participants in the incident | float64 |
| 14 | max_age_participants | Numeric (Ratio) | Maximum age of the participants in the incident | int64 |
| 15 | n_participants_child | Numeric (Ratio) | Number of child participants 0-11 | int64 |
| 16 | n_participants_teen | Numeric (Ratio) | Number of teen participants 12-17 | int64 |
| 17 | n_participants_adult | Numeric (Ratio) | Number of adult participants (18 +) | int64 |
| 18 | n_males | Numeric (Ratio) | Number of males participants | int64 |
| 19 | n_females | Numeric (Ratio) | Number of females participants | int64 |
| 20 | n_killed | Numeric (Ratio) | Number of people killed | int64 |
| 21 | n_injured | Numeric (Ratio) | Number of people injured | int64 |
| 22 | n_arrested | Numeric (Ratio) | Number of arrested participants | int64 |
| 23 | n_unharmed | Numeric (Ratio) | Number of unharmed participants | int64 |
| 24 | n_participants | Numeric (Ratio) | Number of participants in the incident | int64 |
| 25 | notes | Categorical (Nominal) | Additional notes about the incident | object |
| 26 | incident_characteristics1 | Categorical (Nominal) | Incident characteristics | object |
| 27 | incident_characteristics2 | Categorical (Nominal) | Incident characteristics (not all incidents have two available characteristics) | object |
We display a concise summary of the DataFrame:
incidents_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 239677 entries, 0 to 239676 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 239677 non-null object 1 state 239677 non-null object 2 city_or_county 239677 non-null object 3 address 223180 non-null object 4 latitude 231754 non-null float64 5 longitude 231754 non-null float64 6 congressional_district 227733 non-null float64 7 state_house_district 200905 non-null float64 8 state_senate_district 207342 non-null float64 9 participant_age1 147379 non-null float64 10 participant_age_group1 197558 non-null object 11 participant_gender1 203315 non-null object 12 min_age_participants 164879 non-null object 13 avg_age_participants 165057 non-null object 14 max_age_participants 164969 non-null object 15 n_participants_child 197573 non-null object 16 n_participants_teen 197578 non-null object 17 n_participants_adult 197575 non-null object 18 n_males 203315 non-null float64 19 n_females 203315 non-null float64 20 n_killed 239677 non-null int64 21 n_injured 239677 non-null int64 22 n_arrested 212051 non-null float64 23 n_unharmed 212051 non-null float64 24 n_participants 239677 non-null float64 25 notes 158660 non-null object 26 incident_characteristics1 239351 non-null object 27 incident_characteristics2 141931 non-null object dtypes: float64(11), int64(2), object(15) memory usage: 51.2+ MB
We notice that:
congressional_district, state_house_district, state_senate_district, participant_age1, n_males, n_females, n_arrested, n_unharmed, n_participants are stored as float64 while should be int64min_age_participants, avg_age_participants, max_age_participants, n_participants_child, n_participants_teen, n_participants_adult are stored as object while should be int64, this probably indicates the presence of syntactic errors (not in the domain)date, state, city_or_county, n_killed, n_injured, n_participantsWe display descriptive statistics of the DataFrame so to better understand how to cast the data:
incidents_df.describe(include='all')
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 239677 | 239677 | 239677 | 223180 | 231754.000000 | 231754.000000 | 227733.000000 | 200905.000000 | 207342.00000 | 147379.000000 | 197558 | 203315 | 164879 | 165057 | 164969 | 197573 | 197578 | 197575 | 203315.000000 | 203315.000000 | 239677.000000 | 239677.000000 | 212051.000000 | 212051.000000 | 239677.000000 | 158660 | 239351 | 141931 |
| unique | 2437 | 51 | 12898 | 198037 | NaN | NaN | NaN | NaN | NaN | NaN | 3 | 3 | 12673 | 12869 | 12852 | 25 | 33 | 50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 136652 | 52 | 90 |
| top | 2017-01-01 | Illinois | Chicago | 2375 International Pkwy | NaN | NaN | NaN | NaN | NaN | NaN | Adult 18+ | Male | 19.0 | 22.0 | 24.0 | 0.0 | 0.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | man shot | Shot - Wounded/Injured | Officer Involved Incident |
| freq | 342 | 17556 | 10814 | 160 | NaN | NaN | NaN | NaN | NaN | NaN | 181324 | 177945 | 7981 | 7606 | 6237 | 193852 | 178223 | 106461 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 501 | 93926 | 13881 |
| mean | NaN | NaN | NaN | NaN | 37.546598 | -89.338348 | 8.001265 | 55.447132 | 20.47711 | 30.295707 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.520252 | 0.212340 | 0.252290 | 0.494007 | 0.468439 | 0.494169 | 1.636895 | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 5.130763 | 14.359546 | 8.480835 | 42.048117 | 14.20456 | 13.363592 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.996767 | 0.490888 | 0.521779 | 0.729952 | 0.851035 | 0.925566 | 1.252514 | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 19.111400 | -171.429000 | 0.000000 | 1.000000 | 1.00000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 33.903400 | -94.158725 | 2.000000 | 21.000000 | 9.00000 | 21.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 38.570600 | -86.249600 | 5.000000 | 47.000000 | 19.00000 | 27.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 41.437375 | -80.048625 | 10.000000 | 84.000000 | 30.00000 | 37.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 71.336800 | 97.433100 | 53.000000 | 901.000000 | 94.00000 | 311.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61.000000 | 23.000000 | 50.000000 | 53.000000 | 63.000000 | 20.000000 | 103.000000 | NaN | NaN | NaN |
We cast the attributes to the correct type:
# NUMERIC ATTRIBUTES
numerical_features = [
'participant_age1',
'n_males',
'n_females',
'n_killed',
'n_injured',
'n_arrested',
'n_unharmed',
'n_participants',
'min_age_participants',
'avg_age_participants',
'max_age_participants',
'n_participants_child',
'n_participants_teen',
'n_participants_adult',
# (the following attributes should be categorical, but for convenience we keep them numeric)
'congressional_district',
'state_house_district',
'state_senate_district'
]
incidents_df[numerical_features] = incidents_df[numerical_features].apply(pd.to_numeric, errors='coerce')
# DATE
incidents_df['date'] = pd.to_datetime(incidents_df['date'], format='%Y-%m-%d')
# CATEGORICAL ATTRIBUTES
# nominal
incidents_df['participant_gender1'] = incidents_df['participant_gender1'].astype("category")
# ordinal
incidents_df['participant_age_group1'] = incidents_df['participant_age_group1'].astype(
pd.api.types.CategoricalDtype(categories = ["Child 0-11", "Teen 12-17", "Adult 18+"], ordered = True))
We display again information about the dataset to check the correctness of the casting and the number of missing values:
incidents_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 239677 entries, 0 to 239676 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 239677 non-null datetime64[ns] 1 state 239677 non-null object 2 city_or_county 239677 non-null object 3 address 223180 non-null object 4 latitude 231754 non-null float64 5 longitude 231754 non-null float64 6 congressional_district 227733 non-null float64 7 state_house_district 200905 non-null float64 8 state_senate_district 207342 non-null float64 9 participant_age1 147379 non-null float64 10 participant_age_group1 197558 non-null category 11 participant_gender1 203315 non-null category 12 min_age_participants 159126 non-null float64 13 avg_age_participants 159168 non-null float64 14 max_age_participants 159084 non-null float64 15 n_participants_child 197568 non-null float64 16 n_participants_teen 197571 non-null float64 17 n_participants_adult 197572 non-null float64 18 n_males 203315 non-null float64 19 n_females 203315 non-null float64 20 n_killed 239677 non-null int64 21 n_injured 239677 non-null int64 22 n_arrested 212051 non-null float64 23 n_unharmed 212051 non-null float64 24 n_participants 239677 non-null float64 25 notes 158660 non-null object 26 incident_characteristics1 239351 non-null object 27 incident_characteristics2 141931 non-null object dtypes: category(2), datetime64[ns](1), float64(17), int64(2), object(6) memory usage: 48.0+ MB
We drop duplicates:
print(f"# of rows before dropping duplicates: {incidents_df.shape[0]}")
incidents_df.drop_duplicates(inplace=True)
print(f"# of rows after dropping duplicates: {incidents_df.shape[0]}")
# of rows before dropping duplicates: 239677 # of rows after dropping duplicates: 239381
Now we visualize missing values:
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(incidents_df.isnull(), cbar=False, xticklabels=True, ax=ax)
ax.set_title('Null values Heatmap')
ax.set_ylabel('row index')
Text(120.72222222222221, 0.5, 'row index')
We observe that:
latitude and longituden_participants_child, n_participants_teen, n_participants_adultn_males, n_femalesn_arrested, n_unharmedparticipant_age1min_age_participants, avg_age_participants, max_age_participants (often missing together but not always)notesincident_characteristics2participant_age1 is missing but participant_age_group1 is not and the same holds for state_house_district w.r.t state_senate_district.latitude and longitude are often available and could be used to recover the missing values of address, congressional_district, state_house_district and state_senate_district (using external libraries that offer this service).We display descriptive statistics:
incidents_df.describe(include='all')
/var/folders/8n/f3j9y6bd3zd_kp83y0vvn_b80000gn/T/ipykernel_5016/1539327463.py:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 239381 | 239381 | 239381 | 222934 | 231458.000000 | 231458.000000 | 227438.000000 | 200648.000000 | 207075.000000 | 147360.000000 | 197508 | 203266 | 1.591070e+05 | 1.591490e+05 | 1.590650e+05 | 197518.000000 | 197521.000000 | 197522.000000 | 203266.000000 | 203266.000000 | 239381.000000 | 239381.000000 | 211988.000000 | 211988.000000 | 239381.000000 | 158558 | 239055 | 141775 |
| unique | 2437 | 51 | 12898 | 198037 | NaN | NaN | NaN | NaN | NaN | NaN | 3 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 136652 | 52 | 90 |
| top | 2017-01-01 00:00:00 | Illinois | Chicago | 2375 International Pkwy | NaN | NaN | NaN | NaN | NaN | NaN | Adult 18+ | Male | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | man shot | Shot - Wounded/Injured | Officer Involved Incident |
| freq | 342 | 17555 | 10813 | 148 | NaN | NaN | NaN | NaN | NaN | NaN | 181274 | 177899 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 501 | 93910 | 13879 |
| first | 2013-01-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| last | 2030-11-28 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | 37.548820 | -89.335650 | 7.997366 | 55.439910 | 20.472819 | 30.295392 | NaN | NaN | 5.675037e+06 | 2.446473e+04 | 1.871629e+04 | 16.126085 | 8.464052 | 18.520398 | 1.520338 | 0.212377 | 0.252589 | 0.494538 | 0.468456 | 0.494179 | 1.638589 | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 5.130140 | 14.359154 | 8.477303 | 42.050721 | 14.206066 | 13.363816 | NaN | NaN | 2.256305e+09 | 2.199620e+06 | 1.071589e+05 | 3295.134752 | 2224.207250 | 3233.125322 | 0.996845 | 0.490928 | 0.522020 | 0.730173 | 0.851113 | 0.925628 | 1.252204 | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 19.111400 | -171.429000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | NaN | NaN | -1.000000e+03 | -1.000000e+03 | -1.000000e+03 | -977.000000 | -947.000000 | -991.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 33.905725 | -94.144100 | 2.000000 | 21.000000 | 9.000000 | 21.000000 | NaN | NaN | 1.900000e+01 | 2.100000e+01 | 2.100000e+01 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 38.572950 | -86.248800 | 5.000000 | 47.000000 | 19.000000 | 27.000000 | NaN | NaN | 2.500000e+01 | 2.700000e+01 | 2.800000e+01 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 41.440950 | -80.048000 | 10.000000 | 84.000000 | 30.000000 | 37.000000 | NaN | NaN | 3.500000e+01 | 3.700000e+01 | 4.000000e+01 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 71.336800 | 97.433100 | 53.000000 | 901.000000 | 94.000000 | 311.000000 | NaN | NaN | 9.000000e+11 | 8.719163e+08 | 8.300000e+06 | 886365.000000 | 762487.000000 | 827900.000000 | 61.000000 | 23.000000 | 50.000000 | 53.000000 | 63.000000 | 20.000000 | 103.000000 | NaN | NaN | NaN |
We can already make some considerations about the dataset:
state is Illinois and the most frequent value for city_or_county is Chicago (which is in Illinois, it is consistent)incidents_characteristics1 and the most frequent is "Shot - Wounded/Injured" (at the time of data collection, it is likely that the values this attribute could take on were limited to a predefined set)incidents_characteristicsch2 and the most frequent is "Officer Involved Incident"; this attribute presents more missing values than incidents_characteristics1 (at the time of data collection, it is likely that the values this attribute could take on were limited to a predefined set)date is 2030-11-28age, min_age_participants, avg_age_participants, max_age_participants, n_participants_child, n_participants_teen, n_participants_adult is outside the domain of the attributes (e.g. the maximum value for the attribute age is 311)In the following sections of this notebook we will analyze each attribute in detail.
To avoid re-running some cells, we save checkpoints of the dataframe at different stages of the analysis and load the dataframe from the last checkpoint using the following functions:
LOAD_DATA_FROM_CHECKPOINT = True
CHECKPOINT_FOLDER_PATH = 'checkpoints/'
def save_checkpoint(df, checkpoint_name):
df.to_csv(CHECKPOINT_FOLDER_PATH + checkpoint_name + '.csv')
def load_checkpoint(checkpoint_name, casting=None, date_cols=None):
df = pd.read_csv(
CHECKPOINT_FOLDER_PATH + checkpoint_name + '.csv',
low_memory=False,
index_col=0,
parse_dates=date_cols,
dtype=casting
)
return df
We plot the distribution of the dates using different binning strategies:
def plot_dates(df_column, title=None, color=None):
def iqr_fence(x):
q1 = x.quantile(0.25)
med = x.quantile(0.5)
q3 = x.quantile(0.75)
IQR = q3 - q1
u = x.max()
l = x.min()
Lower_Fence = builtins.max(builtins.min(x[x > q1 - (1.5 * IQR)], default=pd.Timestamp.min), l)
Upper_Fence = builtins.min(builtins.max(x[x < q3 + (1.5 * IQR)], default=pd.Timestamp.max), u)
return [Lower_Fence, q1, med, q3, Upper_Fence]
relevant_positions = iqr_fence(df_column)
n_items = len(df_column.index)
min = df_column.min()
max = df_column.max()
fig, axs = plt.subplots(3, sharex=True, figsize=(14, 6))
fig.suptitle(title)
# one bin per month
n_bin = int((max - min).days / 30)
axs[0].hist(df_column, bins=n_bin, density=True, color=color)
axs[0].set_ylabel("One bin per month")
axs[0].grid(axis='y')
# number of bins computed using Sturge's rule
n_bin = int(1 + math.log2(n_items))
axs[1].hist(df_column, bins=n_bin, density=True, color=color)
axs[1].set_ylabel("Sturge\'s rule binning")
axs[1].grid(axis='y')
axs[2].boxplot(x=mdates.date2num(df_column), labels=[''], vert=False)
axs[2].set_xlabel('date')
for i in range(2):
axs[i].axvline(x = relevant_positions[0], color = 'black', linestyle = '--', alpha=0.75)
axs[i].axvline(x = relevant_positions[1], color = 'black', linestyle = '-.', alpha=0.75)
axs[i].axvline(x = relevant_positions[2], color = 'black', linestyle = '-.', alpha=0.75)
axs[i].axvline(x = relevant_positions[3], color = 'black', linestyle = '-.', alpha=0.75)
axs[i].axvline(x = relevant_positions[4], color = 'black', linestyle = '--', alpha=0.75)
plot_dates(incidents_df['date'], title='Dates distribution')
print('Range data: ', incidents_df['date'].min(), ' - ', incidents_df['date'].max())
print('Unique years: ', sorted(incidents_df['date'].dt.year.unique()))
num_oor = incidents_df[incidents_df['date'].dt.year>2018].shape[0]
print(f'Number of rows with out of range value for the attribute date: {num_oor} ({num_oor/incidents_df.shape[0]*100:.2f}%)')
Range data: 2013-01-01 00:00:00 - 2030-11-28 00:00:00 Unique years: [2013, 2014, 2015, 2016, 2017, 2018, 2028, 2029, 2030] Number of rows with out of range value for the attribute date: 23008 (9.61%)
These plots show that the number of incidents with an out of range value for the attribute date is non negligible (9.6%) and, excluding these points, there are no incidents happened after the year 2018. Instead of discarding rows with out-of-range dates, we will try to correct the errors to prevent excessive data loss. Since there are no other features that could suggest the timeframe of the incident, we can only proceed using one of the following approaches:
Let's check if there are duplicates with a correct date:
incidents_future = incidents_df[incidents_df['date'].dt.year>2018].drop(columns=['date'])
incidents_past = incidents_df[incidents_df['date'].dt.year<2019].drop(columns=['date'])
incidents_past[incidents_past.isin(incidents_future).any(axis=1)].size!=0
False
Since there are no duplicates, we proceed with the second and third approach:
incidents_df['year'] = incidents_df['date'].dt.year
mean_date = incidents_df[incidents_df['year']<2019]['date'].mean()
median_date = incidents_df[incidents_df['year']<2019]['date'].median()
incidents_df['date_minus10'] = incidents_df['date']
incidents_df['date_minus10'] = incidents_df['date'].apply(lambda x : x - pd.DateOffset(years=10) if x.year>2018 else x)
incidents_df['date_minus11'] = incidents_df['date']
incidents_df['date_minus11'] = incidents_df['date'].apply(lambda x : x - pd.DateOffset(years=11) if x.year>2018 else x)
incidents_df['date_mean'] = incidents_df['date']
incidents_df['date_mean'] = incidents_df['date'].apply(lambda x : mean_date if x.year>2018 else x)
incidents_df['date_mean'] = pd.to_datetime(incidents_df['date_mean'], format='%Y-%m-%d') # discard hours, minutes and seconds
incidents_df['date_median'] = incidents_df['date']
incidents_df['date_median'] = incidents_df['date'].apply(lambda x : median_date if x.year>2018 else x)
plot_dates(incidents_df['date_minus10'], 'Dates distribution (year - 10 for oor)')
plot_dates(incidents_df['date_minus11'], 'Dates distribution (year - 11 for oor)', color='orange')
plot_dates(incidents_df['date_mean'], 'Dates distribution (oor replaced with mean)', color='green')
plot_dates(incidents_df['date_median'], 'Dates distribution (oor replaced with median)', color='red')
Unfortunately, these methods lead to unsatisfactory results, as they significantly alter the distribution. Therefore, we decided to split the date attribute into year, month and day, and set to nan the date attribute if larger than 2018. We also recover the day of the week in which the incident occurred.
incidents_df.drop(columns=['date_minus10', 'date_minus11', 'date_mean', 'date_median'], inplace=True)
incidents_df['date_original'] = incidents_df['date']
incidents_df['date'] = incidents_df['date'].apply(lambda x : pd.NaT if x.year>2018 else x)
incidents_df['year'] = incidents_df['date'].dt.year.astype('UInt64')
incidents_df['month'] = incidents_df['date'].dt.month.astype('UInt64')
incidents_df['month_name'] = incidents_df['date'].dt.month_name()
incidents_df['day'] = incidents_df['date'].dt.day.astype('UInt64')
incidents_df['day_of_week'] = incidents_df['date'].dt.dayofweek.astype('UInt64')
incidents_df['day_of_week_name'] = incidents_df['date'].dt.day_name()
We visualize the number of incidents per month:
incidents_df.groupby('month').size().plot(
kind='bar',
figsize=(10, 5),
title='Number of incidents per month',
xlabel='Month',
ylabel='Number of incidents'
)
plt.xticks(range(12), calendar.month_name[1:13], rotation=45);
plt.savefig("../html/incidents_per_month.svg")
# display the months with a number of incidents below the 25th percentile
months = incidents_df.groupby('month').size()
months[months<months.quantile(0.25)].index
Index([4, 6, 11], dtype='UInt64', name='month')
# display the months with a number of incidents above the 75th percentile
months[months>months.quantile(0.75)].index
Index([1, 3, 8], dtype='UInt64', name='month')
We visualize the number of incidents per day of the week:
fig = incidents_df.groupby('day_of_week').size().plot(
kind='bar',
figsize=(10, 5),
title='Number of incidents per day of the week',
xlabel='Day of the week',
ylabel='Number of incidents'
)
plt.xticks(range(7), calendar.day_name[0:7], rotation=45);
plt.savefig("../html/incidents_per_week_day.svg")
# display the days of the week with a number of incidents below the 25th percentile
week_days = incidents_df.groupby('day_of_week').size()
week_days[week_days<week_days.quantile(0.25)].index
Index([3, 4], dtype='UInt64', name='day_of_week')
# display the days of the week with a number of incidents above the 75th percentile
week_days[week_days>week_days.quantile(0.75)].index
Index([5, 6], dtype='UInt64', name='day_of_week')
We display the number of incidents per day over the years:
def group_by_day(df, date_col):
counts_by_day = df[date_col].groupby([df[date_col].dt.year, df[date_col].dt.month, df[date_col].dt.day]).size().rename_axis(['year', 'month', 'day']).to_frame('Number of incidents').reset_index()
counts_by_day[['year', 'month', 'day']] = counts_by_day[['year', 'month', 'day']].astype('int64')
# add missing days
for day in pd.date_range(start='2017-01-01', end='2017-12-31'): # 2017%4!=0, this will exclude 29 February
for year in counts_by_day['year'].unique():
row_exists = (
(counts_by_day['year']==year) &
(counts_by_day['month']==day.month) &
(counts_by_day['day']==day.day)
).any()
if not row_exists:
counts_by_day = pd.concat([
counts_by_day,
pd.DataFrame({'year': [year], 'month': [day.month], 'day': [day.day], 'Number of incidents': [0]})
])
counts_by_day.sort_values(by=['year', 'month', 'day'], inplace=True)
counts_by_day['Day'] = counts_by_day.apply(lambda x: f'{x["day"]} {calendar.month_name[x["month"]]}', axis=1)
return counts_by_day
incidents_counts_by_day = group_by_day(
incidents_df[~((incidents_df['date'].dt.day==29) & (incidents_df['date'].dt.month==2))], # exclude 29 february
'date_original'
)
fig = px.line(
incidents_counts_by_day,
x='Day',
y='Number of incidents',
title='Number of incidents per day',
labels={'Day': 'Day of the year', 'Number of incidents': 'Number of incidents'},
facet_col='year',
width=1200,
height=800,
facet_col_wrap=3
)
fig.update_xaxes(tickangle=-90)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_day_line.html', auto_open=False);
We display the number of incidents per day over the years as heatmap, using a calendar plot:
incidenst_per_day = incidents_df.groupby('date').size()
fig = calplot(
incidenst_per_day.reset_index(),
x="date",
y=0,
title="Number of incidents per day"
)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_day_heatmap.html', auto_open=False);
We visualize the frequency of incidents occurring across different festivities. We consider the following holidays (most of them are federal holidays)
| Holiday | Date |
|---|---|
| New Year’s Day | January 1 |
| Martin Luther King’s Birthday | 3rd Monday in January |
| Washington’s Birthday | 3rd Monday in February |
| Memorial Day | last Monday in May |
| Juneteenth National Independence Day | June 19 |
| Independence Day | July 4 |
| Labor Day | 1st Monday in September |
| Columbus Day | 2nd Monday in October |
| Veterans’ Day | November 11 |
| Thanksgiving Day | 4th Thursday in November |
| Christmas Day | December 25 |
| Easter | Sunday (based on moon phase) |
| Easter Monday | Day after Easter |
| Black Friday | Day after Thanksgiving |
holiday_dict = {'New Year\'s Day': ['2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01'],
'Martin Luther King\'s Day': ['2013-01-21', '2014-01-20', '2015-01-19', '2016-01-18', '2017-01-16', '2018-01-15'],
'Washington\'s Birthday': ['2013-02-18', '2014-02-17', '2015-02-16', '2016-02-15', '2017-02-20', '2018-02-19'],
'Saint Patrick\'s Day': ['2013-03-17', '2014-03-17', '2015-03-17', '2016-03-17', '2017-03-17', '2018-03-17'],
'Easter': ['2013-03-31', '2014-04-20', '2015-04-05', '2016-03-27', '2017-04-16', '2018-04-01'],
'Easter Monday': ['2013-04-01', '2014-04-21', '2015-04-06', '2016-03-28', '2017-04-17', '2018-04-02'],
'Memorial Day': ['2013-05-27', '2014-05-26', '2015-05-25', '2016-05-30', '2017-05-29', '2018-05-28'],
'Juneteenth National Independence Day': ['2013-06-19', '2014-06-19', '2015-06-19', '2016-06-19', '2017-06-19', '2018-06-19'],
'Independence Day': ['2013-07-04', '2014-07-04', '2015-07-03', '2016-07-04', '2017-07-04', '2018-07-04'],
'Labor Day': ['2013-09-02', '2014-09-01', '2015-09-07', '2016-09-05', '2017-09-04', '2018-09-03'],
'Columbus Day': ['2013-10-14', '2014-10-13', '2015-10-12', '2016-10-10', '2017-10-09', '2018-10-08'],
'Veterans\' Day': ['2013-11-11', '2014-11-11', '2015-11-11', '2016-11-11', '2017-11-11', '2018-11-11'],
'Thanksgiving Day': ['2013-11-28', '2014-11-27', '2015-11-26', '2016-11-24', '2017-11-23', '2018-11-22'],
'Black Friday': ['2013-11-29', '2014-11-28', '2015-11-27', '2016-11-25', '2017-11-24', '2018-11-23'],
'Christmas Day': ['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-26', '2017-12-25', '2018-12-25']}
We display the number of incidents occurring on each holiday for each year:
dfs = []
for holiday in holiday_dict.keys():
holiday_data = {
'holiday': holiday,
'n_incidents_2013': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][0]])].shape[0],
'n_incidents_2014': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][1]])].shape[0],
'n_incidents_2015': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][2]])].shape[0],
'n_incidents_2016': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][3]])].shape[0],
'n_incidents_2017': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][4]])].shape[0],
'n_incidents_2018': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][5]])].shape[0],
'n_incidents_total': incidents_df[incidents_df['date'].isin(holiday_dict[holiday])].shape[0]
}
df = pd.DataFrame([holiday_data])
dfs.append(df)
dfs.append(pd.DataFrame([{
'holiday': 'Total incidents during the year',
'n_incidents_2013': incidents_df[incidents_df['date'].dt.year==2013].shape[0],
'n_incidents_2014': incidents_df[incidents_df['date'].dt.year==2014].shape[0],
'n_incidents_2015': incidents_df[incidents_df['date'].dt.year==2015].shape[0],
'n_incidents_2016': incidents_df[incidents_df['date'].dt.year==2016].shape[0],
'n_incidents_2017': incidents_df[incidents_df['date'].dt.year==2017].shape[0],
'n_incidents_2018': incidents_df[incidents_df['date'].dt.year==2018].shape[0],
'n_incidents_total': incidents_df.shape[0]}]))
holidays_df = pd.concat(dfs, ignore_index=True)
holidays_df.sort_values(by=['n_incidents_2017'], ascending=False, inplace=True)
holidays_df
| holiday | n_incidents_2013 | n_incidents_2014 | n_incidents_2015 | n_incidents_2016 | n_incidents_2017 | n_incidents_2018 | n_incidents_total | |
|---|---|---|---|---|---|---|---|---|
| 15 | Total incidents during the year | 253 | 37592 | 44614 | 58724 | 61389 | 13801 | 239381 |
| 0 | New Year's Day | 3 | 171 | 185 | 221 | 342 | 242 | 1164 |
| 8 | Independence Day | 2 | 135 | 106 | 224 | 248 | 0 | 715 |
| 4 | Easter | 1 | 97 | 122 | 172 | 229 | 0 | 621 |
| 6 | Memorial Day | 0 | 116 | 150 | 161 | 217 | 0 | 644 |
| 1 | Martin Luther King's Day | 1 | 124 | 90 | 118 | 186 | 155 | 674 |
| 7 | Juneteenth National Independence Day | 0 | 89 | 111 | 185 | 185 | 0 | 570 |
| 5 | Easter Monday | 0 | 97 | 122 | 164 | 180 | 0 | 563 |
| 2 | Washington's Birthday | 0 | 64 | 85 | 132 | 179 | 165 | 625 |
| 9 | Labor Day | 0 | 145 | 144 | 207 | 172 | 0 | 668 |
| 13 | Black Friday | 2 | 77 | 99 | 157 | 170 | 0 | 505 |
| 11 | Veterans' Day | 1 | 112 | 113 | 158 | 164 | 0 | 548 |
| 10 | Columbus Day | 0 | 127 | 114 | 170 | 160 | 0 | 571 |
| 3 | Saint Patrick's Day | 2 | 91 | 116 | 144 | 150 | 178 | 681 |
| 12 | Thanksgiving Day | 1 | 75 | 93 | 149 | 133 | 0 | 451 |
| 14 | Christmas Day | 3 | 68 | 131 | 174 | 119 | 0 | 495 |
We display the number of incidents occurring on each holiday for each year as a percentage:
holidays_df_percents = holidays_df.copy()
holidays_df_percents['n_incidents_2013'] = holidays_df_percents['n_incidents_2013'] / holidays_df_percents[
holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2013'].values[0] * 100
holidays_df_percents['n_incidents_2014'] = holidays_df_percents['n_incidents_2014'] / holidays_df_percents[
holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2014'].values[0] * 100
holidays_df_percents['n_incidents_2015'] = holidays_df_percents['n_incidents_2015'] / holidays_df_percents[
holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2015'].values[0] * 100
holidays_df_percents['n_incidents_2016'] = holidays_df_percents['n_incidents_2016'] / holidays_df_percents[
holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2016'].values[0] * 100
holidays_df_percents['n_incidents_2017'] = holidays_df_percents['n_incidents_2017'] / holidays_df_percents[
holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2017'].values[0] * 100
holidays_df_percents['n_incidents_2018'] = holidays_df_percents['n_incidents_2018'] / holidays_df_percents[
holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2018'].values[0] * 100
holidays_df_percents.drop(holidays_df.index[0], inplace=True)
holidays_df_percents.sort_values(by=['n_incidents_2017'], ascending=False, inplace=True)
holidays_df_percents
| holiday | n_incidents_2013 | n_incidents_2014 | n_incidents_2015 | n_incidents_2016 | n_incidents_2017 | n_incidents_2018 | n_incidents_total | |
|---|---|---|---|---|---|---|---|---|
| 0 | New Year's Day | 1.185771 | 0.454884 | 0.414668 | 0.376337 | 0.557103 | 1.753496 | 1164 |
| 8 | Independence Day | 0.790514 | 0.359119 | 0.237594 | 0.381445 | 0.403981 | 0.000000 | 715 |
| 4 | Easter | 0.395257 | 0.258034 | 0.273457 | 0.292896 | 0.373031 | 0.000000 | 621 |
| 6 | Memorial Day | 0.000000 | 0.308576 | 0.336217 | 0.274164 | 0.353484 | 0.000000 | 644 |
| 1 | Martin Luther King's Day | 0.395257 | 0.329857 | 0.201730 | 0.200940 | 0.302986 | 1.123107 | 674 |
| 7 | Juneteenth National Independence Day | 0.000000 | 0.236753 | 0.248801 | 0.315033 | 0.301357 | 0.000000 | 570 |
| 5 | Easter Monday | 0.000000 | 0.258034 | 0.273457 | 0.279273 | 0.293212 | 0.000000 | 563 |
| 2 | Washington's Birthday | 0.000000 | 0.170249 | 0.190523 | 0.224780 | 0.291583 | 1.195566 | 625 |
| 9 | Labor Day | 0.000000 | 0.385720 | 0.322769 | 0.352496 | 0.280180 | 0.000000 | 668 |
| 13 | Black Friday | 0.790514 | 0.204831 | 0.221903 | 0.267352 | 0.276923 | 0.000000 | 505 |
| 11 | Veterans' Day | 0.395257 | 0.297936 | 0.253284 | 0.269055 | 0.267149 | 0.000000 | 548 |
| 10 | Columbus Day | 0.000000 | 0.337838 | 0.255525 | 0.289490 | 0.260633 | 0.000000 | 571 |
| 3 | Saint Patrick's Day | 0.790514 | 0.242073 | 0.260008 | 0.245215 | 0.244343 | 1.289762 | 681 |
| 12 | Thanksgiving Day | 0.395257 | 0.199511 | 0.208455 | 0.253729 | 0.216651 | 0.000000 | 451 |
| 14 | Christmas Day | 1.185771 | 0.180890 | 0.293630 | 0.296301 | 0.193846 | 0.000000 | 495 |
We visualize the data using a bar plot:
fig = px.bar(
holidays_df.drop(holidays_df.index[0]),
x='holiday',
y=['n_incidents_2013', 'n_incidents_2014', 'n_incidents_2015', 'n_incidents_2016', 'n_incidents_2017', 'n_incidents_2018'],
title='Number of incidents per holiday',
labels={'holiday': 'Holiday', 'value': 'Number of incidents', 'variable': 'Year'},
barmode='group',
)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_holiday.html', auto_open=False);
We notice that the distribution of the number of incidents during each holiday remains consistent over the years. This consistency aligns with our expectations, given the similarity in the distribution of incidents across days throughout the years.
New Year's Eve, followed by the Independence day are the holiday with the highest number of incidents, while Christmas and Thanksgiving are the holidays with the lowest number of incidents.
We check if the values of the attribute state are admissible comparing them with an official list of states:
usa_states_df = pd.read_csv(
'https://www2.census.gov/geo/docs/reference/state.txt',
sep='|',
dtype={'STATE': str, 'STATE_NAME': str}
)
usa_name_alphcode = usa_states_df.set_index('STATE_NAME').to_dict()['STUSAB']
states = incidents_df['state'].unique()
not_existing_states = False
missing_states = False
for state in states:
if state not in usa_name_alphcode:
not_existing_states = True
print(f"State {state} does not exist")
for state in usa_name_alphcode:
if state not in states:
missing_states = True
print(f"State {state} is missing")
if not_existing_states == False:
print("All the values of the attribute 'states' are actually USA states (there are no misspelling or other errors).")
if missing_states == False:
print("There is at least one incident for each USA state.")
State American Samoa is missing State Guam is missing State Northern Mariana Islands is missing State Puerto Rico is missing State U.S. Minor Outlying Islands is missing State U.S. Virgin Islands is missing All the values of the attribute 'states' are actually USA states (there are no misspelling or other errors).
We now check if, given a certain value for the attributes latitude and longitude, the attribute city_or_county has always the same value:
incidents_df.groupby(['latitude', 'longitude'])['city_or_county'].unique()[lambda x: x.str.len() > 1].to_frame()
| city_or_county | ||
|---|---|---|
| latitude | longitude | |
| 19.7009 | -155.0880 | [Hilo, Hawaii (county)] |
| 24.5611 | -81.7732 | [Key West, Key West (Stock Island)] |
| 24.7186 | -81.0668 | [Marathon (Grassy Key), Marathon] |
| 24.7250 | -81.0470 | [Marathon (Duck Key), Marathon] |
| 25.1326 | -80.4124 | [Key Largo , Key Largo] |
| ... | ... | ... |
| 59.8040 | -151.1450 | [Homer, Homer (Fritz Creek)] |
| 60.5844 | -151.2350 | [Nikiski, Kenai] |
| 60.7854 | -161.8010 | [Bethel (Newtok), Newtok, Bethel (Napaskiak), Bethel] |
| 61.5803 | -149.6110 | [Wasilla, Big Lake] |
| 63.3547 | -143.3610 | [Tok, Tanacross, Tok (Mentasta Lake)] |
1518 rows × 1 columns
That is not true and is due to the fact that sometimes the attribute city_or_county takes on the value of the city, other times the value of the county (as in the first row displayed above). Furthermore, we notice that even when the attribute refers to the same county it could be written in different ways (e.g. "Bethel (Newtok)", "Bethel (Napaskiak)", "Bethel").
We now check if a similar problem occurs for the attribute address:
incidents_df.groupby(['latitude', 'longitude'])['address'].unique()[lambda x: x.str.len() > 1].to_frame()
| address | ||
|---|---|---|
| latitude | longitude | |
| 19.6950 | -155.066 | [2100 Kanoelehua Ave, 2100 Kanoelehua Avenue] |
| 19.7052 | -155.062 | [2450 Kekuanaoa Street, 2450 Kekuanaoa St.] |
| 19.7297 | -155.090 | [33rd Avenue, Kamehameha Highway, Kilauea Avenue extension] |
| 21.2766 | -157.687 | [8102 Kalanianaole Highway, 8102 Kalanianaole Hwy] |
| 21.3061 | -157.861 | [nan, 1 Keahole Place] |
| ... | ... | ... |
| 64.8580 | -147.701 | [423 Merhar Avenue, 425 Merhar Ave] |
| 64.8599 | -147.741 | [1215 Bunnell St, 1215 Bunnell Street] |
| 64.8621 | -147.751 | [College Road, 2600 College Rd] |
| 64.8636 | -147.759 | [1800 College Road, 1800 College Rd] |
| 64.8893 | -147.759 | [Farmer's Loop, Farmers Loop Road, Farmers Loop] |
12315 rows × 1 columns
Still this attribute may be written in different ways (e.g. "Avenue" may also be written as "Ave", or "Highway" as "Hwy"). There could also be some errors (e.g. the same point corresponds to the address "33rd Avenue", "Kamehameha Highway" and "Kilauea Avenue extension").
We plot on a map the location of the incidents:
fig = px.scatter_mapbox(
lat=incidents_df['latitude'],
lon=incidents_df['longitude'],
zoom=0,
height=500,
width=800
)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
There are some points in China that are clearly wrong. We display the rows of the dataset that correspond to one of these points:
incidents_df[(incidents_df['latitude'] == 37.6499) & (incidents_df['longitude'] == 97.4331)]
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30351 | 2018-03-29 | Kansas | Wichita | NaN | 37.6499 | 97.4331 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | 0.0 | NaN | Non-Shooting Incident | TSA Action | 2018 | 2018-03-29 | 3 | March | 29 | 3 | Thursday |
| 102739 | 2018-03-25 | Kansas | Wichita | NaN | 37.6499 | 97.4331 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | 0.0 | NaN | Non-Shooting Incident | TSA Action | 2018 | 2018-03-25 | 3 | March | 25 | 6 | Sunday |
That point has probably the correct values for the attributes state and city_or_county.
To fix these inconsistencies we used the library GeoPy). This library allows to retrieve the address (state, county, suburb, city, town, village, location name, and other features) corresponding to a given latitude and longitude. We queried the library using all the latitudes and longitudes of the points in the dataset and we saved the results in the CSV file we now load:
geopy_path = os.path.join(DATA_FOLDER_PATH, 'external_data/geopy.csv')
geopy_df = pd.read_csv(geopy_path, index_col=['index'], low_memory=False, dtype={})
geopy_df.sample(2, random_state=1)
| importance_geopy | addresstype_geopy | address_geopy | coord_presence | city_geopy | county_geopy | state_geopy | suburb_geopy | village_geopy | town_geopy | |
|---|---|---|---|---|---|---|---|---|---|---|
| index | ||||||||||
| 8141 | 0.000010 | building | 2706, South Lawndale Avenue, Little Village, South Lawndale, Chicago, West Chicago Township, Cook County, Illinois, 60623, United States | True | Chicago | Cook County | Illinois | NaN | NaN | NaN |
| 64095 | 0.166904 | historic | King David Kalakaua Building, 335, Merchant Street, Downtown, Hawaii Capital Historic District, Honolulu, Honolulu County, Hawaii, 96813, United States | True | Honolulu | Honolulu County | Hawaii | Hawaii Capital Historic District | NaN | NaN |
The rows in this dataframe correspond to the rows in the original dataset. Its column coord_presence is false if the corresponding row in the original dataset did not have latitude and longitude values.
Among all the attributes returned by GeoPy, we selected and used the following:
print(f"Number of rows in which surburb is null: {geopy_df.loc[geopy_df['suburb_geopy'].isna()].shape[0]}\n")
print('Coordinate presence:')
display(geopy_df['coord_presence'].value_counts())
print('Importance presence:')
display(geopy_df['importance_geopy'].notna().value_counts())
print(f"Number of rows in which city is null and town is not null: {geopy_df[(geopy_df['city_geopy'].isnull()) & (geopy_df['town_geopy'].notnull())].shape[0]}\n")
print("Values of addresstype:")
print(geopy_df['addresstype_geopy'].unique())
print(f"\nNumber of rows in which addresstype is null: {geopy_df[geopy_df['addresstype_geopy'].isnull()].shape[0]}")
Number of rows in which surburb is null: 213191 Coordinate presence:
True 231754 False 7923 Name: coord_presence, dtype: int64
Importance presence:
True 231754 False 7923 Name: importance_geopy, dtype: int64
Number of rows in which city is null and town is not null: 37534 Values of addresstype: ['road' 'place' 'building' 'amenity' nan 'aeroway' 'tourism' 'leisure' 'county' 'shop' 'historic' 'office' 'man_made' 'craft' 'highway' 'city' 'landuse' 'emergency' 'hamlet' 'railway' 'suburb' 'club' 'municipality' 'junction' 'village' 'military' 'healthcare' 'locality' 'town' 'neighbourhood' 'state' 'natural' 'city_district'] Number of rows in which addresstype is null: 7923
We also downloaded from Wikipedia) the list of the counties (or their equivalent) in each state.
This data was used in case incidents data did not match GeoPy data and when latitude and longitude where not available to check whether the county actually belonged to the state.
counties_path = os.path.join(DATA_FOLDER_PATH, 'external_data/counties.csv')
counties_df = pd.read_csv(counties_path)
counties_df.head()
| County or equivalent | State or equivalent | Population (2020 census) | Area (mi^2) | Core-based statistical area | |
|---|---|---|---|---|---|
| 0 | Autauga | Alabama | 58,805 | 594.44 | Montgomery, AL Metropolitan Statistical Area |
| 1 | Baldwin | Alabama | 231,767 | 1589.78 | Daphne-Fairhope-Foley, AL Metropolitan Statistical Area |
| 2 | Barbour | Alabama | 25,223 | 884.88 | Eufaula, AL-GA Micropolitan Statistical Area |
| 3 | Bibb | Alabama | 22,293 | 622.58 | Birmingham-Hoover, AL Metropolitan Statistical Area |
| 4 | Blount | Alabama | 59,134 | 644.78 | Birmingham-Hoover, AL Metropolitan Statistical Area |
We now check and correct the consistency of the geographic data:
from TASK_1.data_preparation_utils import check_geographical_data_consistency
if LOAD_DATA_FROM_CHECKPOINT:
incidents_df = load_checkpoint('checkpoint_1', date_cols=['date', 'date_original'])
else:
geo_df = incidents_df[['state', 'city_or_county', 'address', 'latitude', 'longitude']]
geo_df = pd.concat([geo_df, geopy_df.loc[incidents_df.index]], axis=1)
geo_df = geo_df.apply(lambda row: check_geographical_data_consistency(row, additional_data=counties_df), axis=1)
incidents_df[geo_df.columns] = geo_df[geo_df.columns]
save_checkpoint(incidents_df, 'checkpoint_1')
The function called above performs the following operations:
When latitude and longitude are available and therefore Geopy provided information for the corresponding location:
If these comparisons fail, it checks for potential typos in the string. This is done using the Damerau-Levenshtein distance (defined below), with a threshold to decide the maximum distance for two strings to be considered equal. The thresholds were set after several preliminary tests. We decided to use different thresholds for state and city or county.
The Damerau-Levenshtein distance between two strings $s$ and $t$ is define as\ $D(i, j) = \min \begin{cases} D(i-1, j) + 1 \\ D(i, j-1) + 1 \\ D(i-1, j-1) + \delta \\ D(i-2, j-2) + \delta & \text{if } s[i] = t[j] \text{ and } s[i-1] = t[j-1] \end{cases}$
where:
If the comparison still fails, it compares the address field from our dataset with GeoPy's display_name (using again the Damerau-Levenshtein distance).
In case of a match with GeoPy data, we set the values for the fields state, county, city, latitude, longitude, importance, address_type to the corresponding fileds provided by GeoPy. Otherwise we check for matches with the Wikipedia data (using again the Damerau-Levenshtein distance).
tot_row = incidents_df.index.size
print('Number of rows with all null values: ', incidents_df.isnull().all(axis=1).sum(), ' / ', incidents_df.isnull().all(axis=1).sum()*100/tot_row, '%')
print('Number of rows with null value for state: ', incidents_df['state'].isnull().sum(), ' / ', incidents_df['state'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for county: ', incidents_df['county'].isnull().sum(), ' / ', incidents_df['county'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for city: ', incidents_df['city'].isnull().sum(), ' / ', incidents_df['city'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for latitude: ', incidents_df['latitude'].isnull().sum(), ' / ', incidents_df['latitude'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for longitude: ', incidents_df['longitude'].isnull().sum(), ' / ', incidents_df['longitude'].isnull().sum()*100/tot_row, '%')
Number of rows with all null values: 0 / 0.0 % Number of rows with null value for state: 0 / 0.0 % Number of rows with null value for county: 20914 / 8.736700072269729 % Number of rows with null value for city: 35794 / 14.952732255275064 % Number of rows with null value for latitude: 12796 / 5.345453482105931 % Number of rows with null value for longitude: 12796 / 5.345453482105931 %
sns.heatmap(incidents_df.isnull(), cbar=False, xticklabels=True);
Now all the entries in the dataset have at least the state value not null and consistent. Only 12,796 data points, which account for 5.34% of the dataset, were found to have inconsistent latitude and longitude values.
incidents_df.groupby(['state_consistency','county_consistency','address_consistency']).size().to_frame().rename(columns={0:'count'}).sort_index(ascending=False)
| count | |||
|---|---|---|---|
| state_consistency | county_consistency | address_consistency | |
| 1 | 1 | 1 | 181794 |
| 0 | 13201 | ||
| -1 | 4956 | ||
| -1 | 1 | 26621 | |
| 0 | 2912 | ||
| -1 | 1887 | ||
| -1 | 1 | 1 | 13 |
| 0 | 4 | ||
| -1 | 2 | ||
| -1 | 1 | 53 | |
| 0 | 11 | ||
| -1 | 7927 |
stats = {}
stats_columns = ['#null_val', '#not_null', '#value_count']
for col in ['state', 'county', 'city', 'latitude', 'longitude', 'state_consistency',
'county_consistency', 'address_consistency', 'location_importance', 'address_type']:
stats[col] = []
stats[col].append(incidents_df[col].isna().sum())
stats[col].append(len(incidents_df[col]) - incidents_df[col].isna().sum())
stats[col].append(len(incidents_df[col].value_counts()))
clean_geo_stat_stats = pd.DataFrame(stats, index=stats_columns).transpose()
clean_geo_stat_stats
| #null_val | #not_null | #value_count | |
|---|---|---|---|
| state | 0 | 239381 | 51 |
| county | 20914 | 218467 | 1955 |
| city | 35794 | 203587 | 8302 |
| latitude | 12796 | 226585 | 99826 |
| longitude | 12796 | 226585 | 110433 |
| state_consistency | 0 | 239381 | 2 |
| county_consistency | 0 | 239381 | 2 |
| address_consistency | 0 | 239381 | 3 |
| location_importance | 12796 | 226585 | 709 |
| address_type | 12796 | 226585 | 32 |
incidents_df[['latitude', 'county', 'city']].isna().groupby(['latitude', 'county', 'city']).size().to_frame().rename(columns={0:'count'})
| count | |||
|---|---|---|---|
| latitude | county | city | |
| False | False | False | 193710 |
| True | 22995 | ||
| True | False | 9877 | |
| True | 3 | ||
| True | False | True | 1762 |
| True | True | 11034 |
incidents_df[['latitude']].isna().groupby(['latitude']).size().to_frame().rename(columns={0:'count'})
| count | |
|---|---|
| latitude | |
| False | 226585 |
| True | 12796 |
incidents_df_not_null = incidents_df[incidents_df['latitude'].notna()]
print('Number of entries with not null values for latitude and longitude: ', len(incidents_df_not_null))
plot_scattermap_plotly(incidents_df_not_null, 'state', zoom=2, title='Incidents distribution by state')
Number of entries with not null values for latitude and longitude: 226585
incidents_df_nan_county = incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].isna()) &
(incidents_df['city'].notna())]
print('Number of entries with not null values for county but not for lat/lon and city: ', len(incidents_df_nan_county))
plot_scattermap_plotly(incidents_df_nan_county, 'state', zoom=2, title='Missing county')
Number of entries with not null values for county but not for lat/lon and city: 9877
incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].isna()) & (incidents_df['city'].notna())].groupby('city').count()
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | state_consistency | county_consistency | address_consistency | location_importance | address_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| city | |||||||||||||||||||||||||||||||||||||||||
| Alexandria | 39 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 41 | 27 | 34 | 35 | 29 | 30 | 28 | 34 | 34 | 34 | 35 | 35 | 41 | 41 | 36 | 36 | 41 | 32 | 41 | 20 | 39 | 41 | 41 | 41 | 41 | 41 | 41 | 0 | 41 | 41 | 41 | 41 | 41 |
| Aurora | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| Baltimore | 1567 | 1856 | 1856 | 1856 | 1856 | 1856 | 1797 | 1631 | 1797 | 1371 | 1792 | 1795 | 1437 | 1418 | 1427 | 1792 | 1792 | 1792 | 1795 | 1795 | 1856 | 1856 | 1821 | 1821 | 1856 | 591 | 1855 | 612 | 1567 | 1856 | 1856 | 1856 | 1856 | 1856 | 1856 | 0 | 1856 | 1856 | 1856 | 1856 | 1856 |
| Brooklyn Park | 35 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 31 | 35 | 34 | 31 | 32 | 31 | 35 | 35 | 35 | 34 | 34 | 36 | 36 | 35 | 35 | 36 | 8 | 36 | 11 | 35 | 36 | 36 | 36 | 36 | 36 | 36 | 0 | 36 | 36 | 36 | 36 | 36 |
| Broomfield | 6 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 3 | 6 | 6 | 4 | 4 | 3 | 6 | 6 | 6 | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 2 | 7 | 5 | 6 | 7 | 7 | 7 | 7 | 7 | 7 | 0 | 7 | 7 | 7 | 7 | 7 |
| Carson City | 25 | 27 | 27 | 27 | 27 | 27 | 26 | 26 | 26 | 21 | 24 | 24 | 22 | 22 | 22 | 24 | 24 | 24 | 24 | 24 | 27 | 27 | 23 | 23 | 27 | 13 | 27 | 23 | 25 | 27 | 27 | 27 | 27 | 27 | 27 | 0 | 27 | 27 | 27 | 27 | 27 |
| Charlottesville | 57 | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 58 | 38 | 44 | 45 | 43 | 40 | 41 | 44 | 44 | 44 | 45 | 45 | 58 | 58 | 46 | 46 | 58 | 56 | 58 | 28 | 57 | 58 | 58 | 58 | 58 | 58 | 58 | 0 | 58 | 58 | 58 | 58 | 58 |
| Chesapeake | 87 | 112 | 112 | 109 | 112 | 112 | 112 | 112 | 112 | 52 | 97 | 99 | 63 | 60 | 63 | 97 | 97 | 97 | 99 | 99 | 112 | 112 | 106 | 106 | 112 | 95 | 111 | 41 | 87 | 112 | 112 | 112 | 112 | 112 | 112 | 0 | 112 | 112 | 112 | 112 | 112 |
| Colonial Heights | 10 | 13 | 13 | 12 | 13 | 13 | 13 | 13 | 13 | 7 | 11 | 11 | 8 | 7 | 8 | 11 | 11 | 11 | 11 | 11 | 13 | 13 | 11 | 11 | 13 | 11 | 13 | 9 | 10 | 13 | 13 | 13 | 13 | 13 | 13 | 0 | 13 | 13 | 13 | 13 | 13 |
| Covington | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 2 | 3 | 3 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 6 | 6 | 3 | 3 | 6 | 3 | 6 | 4 | 5 | 6 | 6 | 6 | 6 | 6 | 6 | 0 | 6 | 6 | 6 | 6 | 6 |
| Danville | 116 | 131 | 131 | 131 | 131 | 131 | 131 | 131 | 130 | 89 | 111 | 110 | 92 | 92 | 89 | 111 | 111 | 111 | 110 | 110 | 131 | 131 | 115 | 115 | 131 | 120 | 131 | 59 | 116 | 131 | 131 | 131 | 131 | 131 | 131 | 0 | 131 | 131 | 131 | 131 | 131 |
| Denver | 553 | 578 | 578 | 574 | 578 | 578 | 572 | 572 | 572 | 166 | 289 | 339 | 193 | 210 | 200 | 289 | 289 | 289 | 339 | 339 | 578 | 578 | 412 | 412 | 578 | 255 | 577 | 279 | 553 | 578 | 578 | 578 | 578 | 578 | 578 | 0 | 578 | 578 | 578 | 578 | 578 |
| Falls Church | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 1 | 3 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | 3 | 3 |
| Franklin | 13 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 9 | 12 | 13 | 9 | 10 | 10 | 12 | 12 | 12 | 13 | 13 | 15 | 15 | 14 | 14 | 15 | 12 | 15 | 6 | 13 | 15 | 15 | 15 | 15 | 15 | 15 | 0 | 15 | 15 | 15 | 15 | 15 |
| Fredericksburg | 29 | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 18 | 19 | 19 | 18 | 18 | 18 | 19 | 19 | 19 | 19 | 19 | 30 | 30 | 19 | 19 | 30 | 28 | 30 | 12 | 29 | 30 | 30 | 30 | 30 | 30 | 30 | 0 | 30 | 30 | 30 | 30 | 30 |
| Galax | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| Hampton | 194 | 208 | 208 | 208 | 208 | 208 | 205 | 205 | 205 | 144 | 183 | 184 | 154 | 154 | 153 | 183 | 183 | 183 | 184 | 184 | 208 | 208 | 194 | 194 | 208 | 182 | 207 | 89 | 194 | 208 | 208 | 208 | 208 | 208 | 208 | 0 | 208 | 208 | 208 | 208 | 208 |
| Harrisonburg | 12 | 12 | 12 | 11 | 12 | 12 | 12 | 12 | 12 | 7 | 9 | 8 | 8 | 8 | 8 | 9 | 9 | 9 | 8 | 8 | 12 | 12 | 10 | 10 | 12 | 9 | 12 | 9 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 0 | 12 | 12 | 12 | 12 | 12 |
| Hopewell | 42 | 46 | 46 | 46 | 46 | 46 | 45 | 45 | 45 | 34 | 43 | 44 | 37 | 37 | 38 | 43 | 43 | 43 | 44 | 44 | 46 | 46 | 46 | 46 | 46 | 39 | 46 | 18 | 42 | 46 | 46 | 46 | 46 | 46 | 46 | 0 | 46 | 46 | 46 | 46 | 46 |
| Juneau | 90 | 94 | 94 | 83 | 94 | 94 | 91 | 91 | 0 | 33 | 41 | 39 | 33 | 34 | 35 | 41 | 41 | 41 | 39 | 39 | 94 | 94 | 41 | 41 | 94 | 86 | 94 | 76 | 90 | 94 | 94 | 94 | 94 | 94 | 94 | 0 | 94 | 94 | 94 | 94 | 94 |
| Lexington | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| Lynchburg | 84 | 89 | 89 | 85 | 89 | 89 | 88 | 88 | 88 | 65 | 79 | 79 | 68 | 69 | 68 | 79 | 79 | 79 | 79 | 79 | 89 | 89 | 82 | 82 | 89 | 78 | 86 | 45 | 84 | 89 | 89 | 89 | 89 | 89 | 89 | 0 | 89 | 89 | 89 | 89 | 89 |
| Manassas | 16 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 9 | 11 | 10 | 10 | 9 | 9 | 11 | 11 | 11 | 10 | 10 | 17 | 17 | 11 | 11 | 17 | 12 | 17 | 7 | 16 | 17 | 17 | 17 | 17 | 17 | 17 | 0 | 17 | 17 | 17 | 17 | 17 |
| Manassas Park | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| Martinsville | 13 | 13 | 13 | 12 | 13 | 13 | 13 | 13 | 13 | 9 | 11 | 11 | 10 | 10 | 10 | 11 | 11 | 11 | 11 | 11 | 13 | 13 | 11 | 11 | 13 | 11 | 13 | 4 | 13 | 13 | 13 | 13 | 13 | 13 | 13 | 0 | 13 | 13 | 13 | 13 | 13 |
| Norfolk | 459 | 546 | 546 | 520 | 546 | 546 | 544 | 544 | 544 | 282 | 437 | 444 | 302 | 314 | 316 | 437 | 437 | 437 | 444 | 444 | 546 | 546 | 497 | 497 | 546 | 485 | 544 | 175 | 459 | 546 | 546 | 546 | 546 | 546 | 546 | 0 | 546 | 546 | 546 | 546 | 546 |
| Norton | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | 3 | 3 |
| Philadelphia | 110 | 119 | 119 | 74 | 119 | 119 | 118 | 3 | 3 | 74 | 90 | 90 | 75 | 76 | 76 | 90 | 90 | 90 | 90 | 90 | 119 | 119 | 94 | 94 | 119 | 73 | 119 | 53 | 110 | 119 | 119 | 119 | 119 | 119 | 119 | 0 | 119 | 119 | 119 | 119 | 119 |
| Portsmouth | 143 | 182 | 182 | 163 | 182 | 182 | 182 | 182 | 182 | 59 | 152 | 158 | 93 | 86 | 85 | 152 | 152 | 152 | 158 | 158 | 182 | 182 | 173 | 173 | 182 | 157 | 182 | 53 | 143 | 182 | 182 | 182 | 182 | 182 | 182 | 0 | 182 | 182 | 182 | 182 | 182 |
| Richmond | 236 | 303 | 303 | 278 | 303 | 303 | 301 | 301 | 301 | 118 | 269 | 273 | 154 | 159 | 161 | 269 | 269 | 269 | 273 | 273 | 303 | 303 | 286 | 286 | 303 | 272 | 300 | 99 | 236 | 303 | 303 | 303 | 303 | 303 | 303 | 0 | 303 | 303 | 303 | 303 | 303 |
| Roanoke | 67 | 73 | 73 | 64 | 73 | 73 | 73 | 73 | 73 | 46 | 57 | 59 | 48 | 47 | 47 | 57 | 57 | 57 | 59 | 59 | 73 | 73 | 62 | 62 | 73 | 69 | 72 | 40 | 67 | 73 | 73 | 73 | 73 | 73 | 73 | 0 | 73 | 73 | 73 | 73 | 73 |
| Saint Louis | 1994 | 2074 | 2074 | 2004 | 2074 | 2074 | 2056 | 2056 | 2056 | 1441 | 1767 | 1929 | 1536 | 1558 | 1544 | 1767 | 1767 | 1767 | 1929 | 1929 | 2074 | 2074 | 2013 | 2013 | 2074 | 1347 | 2072 | 961 | 1994 | 2074 | 2074 | 2074 | 2074 | 2074 | 2074 | 0 | 2074 | 2074 | 2074 | 2074 | 2074 |
| Salem | 5 | 5 | 5 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 3 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 0 | 5 | 5 | 5 | 5 | 5 |
| San Francisco | 318 | 378 | 378 | 364 | 378 | 378 | 377 | 377 | 377 | 220 | 331 | 326 | 240 | 245 | 239 | 331 | 331 | 331 | 326 | 326 | 378 | 378 | 353 | 353 | 378 | 199 | 378 | 183 | 318 | 378 | 378 | 378 | 378 | 378 | 378 | 0 | 378 | 378 | 378 | 378 | 378 |
| Staunton | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 8 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 0 | 10 | 10 | 10 | 10 | 10 |
| Suffolk | 14 | 16 | 16 | 12 | 16 | 16 | 16 | 16 | 16 | 7 | 13 | 13 | 8 | 9 | 10 | 13 | 13 | 13 | 13 | 13 | 16 | 16 | 15 | 15 | 16 | 15 | 16 | 8 | 14 | 16 | 16 | 16 | 16 | 16 | 16 | 0 | 16 | 16 | 16 | 16 | 16 |
| Virginia Beach | 192 | 246 | 246 | 244 | 246 | 246 | 246 | 246 | 246 | 89 | 194 | 196 | 105 | 114 | 107 | 194 | 194 | 194 | 196 | 196 | 246 | 246 | 222 | 222 | 246 | 227 | 246 | 115 | 192 | 246 | 246 | 246 | 246 | 246 | 246 | 0 | 246 | 246 | 246 | 246 | 246 |
| Washington | 2002 | 2493 | 2493 | 2357 | 2493 | 2493 | 2129 | 0 | 0 | 462 | 1464 | 1315 | 580 | 589 | 569 | 1464 | 1464 | 1464 | 1315 | 1315 | 2493 | 2493 | 1706 | 1706 | 2493 | 573 | 2490 | 988 | 2002 | 2493 | 2493 | 2493 | 2493 | 2493 | 2493 | 0 | 2493 | 2493 | 2493 | 2493 | 2493 |
| Waynesboro | 24 | 25 | 25 | 24 | 25 | 25 | 25 | 25 | 25 | 21 | 23 | 23 | 21 | 22 | 21 | 23 | 23 | 23 | 23 | 23 | 25 | 25 | 23 | 23 | 25 | 25 | 25 | 19 | 24 | 25 | 25 | 25 | 25 | 25 | 25 | 0 | 25 | 25 | 25 | 25 | 25 |
| Winchester | 8 | 8 | 8 | 6 | 8 | 8 | 8 | 8 | 8 | 6 | 8 | 8 | 8 | 8 | 6 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 7 | 8 | 6 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 0 | 8 | 8 | 8 | 8 | 8 |
incidents_df[(incidents_df['latitude'].notna()) & (incidents_df['city'].isna()) & (incidents_df['county'].isna())]
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39689 | 2014-06-23 | MICHIGAN | Traverse City | 727 Fly Don't Dr | 44.7881 | -85.5539 | 1.0 | 104.0 | 37.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | 0.0 | Cherry Capital Airport | TSA Action | NaN | 2014.0 | 2014-06-23 | 6 | June | 23 | 0 | Monday | NaN | NaN | 1 | -1 | 1 | 0.737851 | state |
| 68656 | 2014-01-22 | MICHIGAN | Traverse City | 727 Fly Don't Dr | 44.7881 | -85.5539 | 1.0 | 104.0 | 37.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | 0.0 | Cherry Capital Airport | TSA Action | NaN | 2014.0 | 2014-01-22 | 1 | January | 22 | 2 | Wednesday | NaN | NaN | 1 | -1 | 1 | 0.737851 | state |
| 204688 | 2015-07-08 | MICHIGAN | Traverse City | 727 Fly Don't Dr | 44.7881 | -85.5539 | 1.0 | 104.0 | 37.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | NaN | NaN | 0.0 | NaN | Non-Shooting Incident | TSA Action | 2015.0 | 2015-07-08 | 7 | July | 8 | 2 | Wednesday | NaN | NaN | 1 | -1 | 1 | 0.737851 | state |
incidents_df_nan_city = incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].notna()) & (incidents_df['city'].isna())]
print('Number of rows with null values for city, but not for lat/lon and county: ', len(incidents_df_nan_city))
plot_scattermap_plotly(incidents_df_nan_city, 'state', zoom=2, title='Missing city')
Number of rows with null values for city, but not for lat/lon and county: 22995
Final considerations
From this analysis we found that:
The dataset does not contain any other combinations beyond the ones just mentioned.
In the following we will recover the missing information using the latitude and longitude values.
To recover missing values for the attribute city when latitude and longitude are available we will compute the closest city centroid and assign the corresponding city if the distance is below a certain threshold.
First we Compute the centroid for each city and visualize the first 10 centroids in lexicographic order:
centroids = incidents_df.loc[incidents_df['latitude'].notna() & incidents_df['city'].notna()][[
'latitude', 'longitude', 'city', 'state', 'county']].groupby(['state', 'county', 'city']).mean()
centroids.head(10)
| latitude | longitude | |||
|---|---|---|---|---|
| state | county | city | ||
| ALABAMA | Autauga County | Autaugaville | 32.432700 | -86.656433 |
| Prattville | 32.465774 | -86.456941 | ||
| Baldwin County | Bay Minette | 30.870900 | -87.787700 | |
| Daphne | 30.624720 | -87.884390 | ||
| Elberta | 30.414333 | -87.588567 | ||
| Fairhope | 30.520640 | -87.880040 | ||
| Foley | 30.389367 | -87.690533 | ||
| Gulf Shores | 30.279338 | -87.696012 | ||
| Loxley | 30.670000 | -87.758700 | ||
| Orange Beach | 30.293400 | -87.589800 |
print('Number of distinct cities:', len(centroids.index.to_list()))
Number of distinct cities: 11271
For each tuple <state, county, city> in 'centroids', we extract the corresponding latitude and longitude coordinates from the 'clean_geo_data' DataFrame. We then compute the distance between these coordinates and the centroids using the geodesic distance (in kilometers). We also compute percentiles (at 0.05 intervals), maximum, minimum, and average distances of the points in each city.
info_city = pd.DataFrame(columns=['5', '15', '25', '35', '45', '55', '65', '75', '85', '95',
'tot_points', 'min', 'max', 'avg', 'centroid_lat', 'centroid_lon'], index=centroids.index)
info_city.head(2)
if LOAD_DATA_FROM_CHECKPOINT: # load data
info_city = load_checkpoint('checkpoint_cities')
else: # compute data
for state, county, city in centroids.index:
dummy = []
for lat, long in zip(incidents_df.loc[(incidents_df['city'] == city) &
(incidents_df['state'] == state) & (incidents_df['county'] == county) &
incidents_df['latitude'].notna()]['latitude'],
incidents_df.loc[(incidents_df['city'] == city) &
(incidents_df['state'] == state) & (incidents_df['county'] == county) &
incidents_df['longitude'].notna()]['longitude']):
dummy.append(geopy_distance.geodesic([lat, long], centroids.loc[state, county, city]).km)
dummy = sorted(dummy)
pc = np.quantile(dummy, np.arange(0, 1, 0.05))
for i in range(len(info_city.columns) - 6):
info_city.loc[state, county, city][i] = pc[i*2 + 1]
info_city.loc[state, county, city][len(info_city.columns) - 6] = len(dummy)
info_city.loc[state, county, city][len(info_city.columns) - 5] = min(dummy)
info_city.loc[state, county, city][len(info_city.columns) - 4] = max(dummy)
info_city.loc[state, county, city][len(info_city.columns) - 3] = sum(dummy)/len(dummy)
info_city.loc[state, county, city][len(info_city.columns) - 2] = centroids.loc[state, county, city]['latitude']
info_city.loc[state, county, city][len(info_city.columns) - 1] = centroids.loc[state, county, city]['longitude']
save_checkpoint(info_city, 'checkpoint_cities') # save data
We display the resulting dataframe and plot it on a map:
info_city.head()
| county | city | 5 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 85 | 95 | tot_points | min | max | avg | centroid_lat | centroid_lon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| state | ||||||||||||||||||
| ALABAMA | Autauga County | Autaugaville | 0.274213 | 0.289707 | 0.305201 | 0.320695 | 0.336188 | 0.370072 | 0.422346 | 0.474619 | 0.526893 | 0.579166 | 3 | 0.266466 | 0.605303 | 0.405235 | 32.432700 | -86.656433 |
| ALABAMA | Autauga County | Prattville | 0.352777 | 1.233816 | 1.487384 | 2.595130 | 2.728048 | 3.498539 | 3.823870 | 3.912885 | 4.881287 | 6.568004 | 27 | 0.325572 | 8.573536 | 3.135990 | 32.465774 | -86.456941 |
| ALABAMA | Baldwin County | Bay Minette | 0.649590 | 0.851911 | 0.851911 | 0.858401 | 0.998652 | 1.141183 | 1.274665 | 1.507917 | 2.596292 | 5.745650 | 9 | 0.514709 | 7.663826 | 1.936234 | 30.870900 | -87.787700 |
| ALABAMA | Baldwin County | Daphne | 0.881853 | 1.510370 | 2.681879 | 2.700294 | 2.746789 | 3.010000 | 3.450319 | 4.034984 | 4.257822 | 5.261581 | 10 | 0.881853 | 6.059492 | 3.096890 | 30.624720 | -87.884390 |
| ALABAMA | Baldwin County | Elberta | 0.320503 | 0.641146 | 0.961789 | 1.282432 | 1.603075 | 1.777587 | 1.805968 | 1.834349 | 1.862730 | 1.891110 | 3 | 0.160181 | 1.905301 | 1.276293 | 30.414333 | -87.588567 |
info_city.loc[info_city['tot_points'] > 1].info()
<class 'pandas.core.frame.DataFrame'> Index: 6868 entries, ALABAMA to WYOMING Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 county 6868 non-null object 1 city 6868 non-null object 2 5 6868 non-null float64 3 15 6868 non-null float64 4 25 6868 non-null float64 5 35 6868 non-null float64 6 45 6868 non-null float64 7 55 6868 non-null float64 8 65 6868 non-null float64 9 75 6868 non-null float64 10 85 6868 non-null float64 11 95 6868 non-null float64 12 tot_points 6868 non-null int64 13 min 6868 non-null float64 14 max 6868 non-null float64 15 avg 6868 non-null float64 16 centroid_lat 6868 non-null float64 17 centroid_lon 6868 non-null float64 dtypes: float64(15), int64(1), object(2) memory usage: 1019.5+ KB
fig = plot_scattermap_plotly(
info_city,
size='tot_points',
x_column='centroid_lat',
y_column='centroid_lon',
hover_name=False, zoom=2,
title='Number of incidents per city'
)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_city.html', auto_open=False);
We utilize the previously calculated data to infer missing values for the city field in entries of the dataset where latitude and longitude are available. The city field is assigned if the distance of the entry from the centroid falls within the third quartile of all points assigned to that centroid.
def substitute_city(row, info_city):
if pd.isna(row['city']) and not np.isnan(row['latitude']):
for state, county, city in info_city.index:
if row['state'] == state and row['county'] == county:
if info_city.loc[state, county, city]['tot_points'] > 1:
max_radius = info_city.loc[state, county, city]['75'] # terzo quantile
centroid_coord = [info_city.loc[state, county, city]['centroid_lat'],
info_city.loc[state, county, city]['centroid_lon']]
if (geopy_distance.geodesic([row['latitude'], row['longitude']], centroid_coord).km <=
max_radius):
row['city'] = city
break
return row
if LOAD_DATA_FROM_CHECKPOINT:
new_incidents_df = load_checkpoint('checkpoint_2', date_cols=['date', 'date_original'])
else:
new_incidents_df = incidents_df.apply(lambda row: substitute_city(row, info_city), axis=1)
save_checkpoint(incidents_df, 'checkpoint_2')
n_nan_cities = incidents_df['city'].isnull().sum()
n_non_nan_cities = new_incidents_df['city'].isnull().sum()
print('Number of rows with null values for city before the inference: ', n_nan_cities)
print('Number of rows with null values for city after the inference: ', n_non_nan_cities)
print(f'Number of city recovered: {n_nan_cities - n_non_nan_cities}')
Number of rows with null values for city before the inference: 35794 Number of rows with null values for city after the inference: 33545 Number of city recovered: 2249
incidents_df_nan_city = new_incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].notna()) & (incidents_df['city'].isna())]
print('Number of rows with null values for city, but not for lat/lon and county: ', len(incidents_df_nan_city))
plot_scattermap_plotly(incidents_df_nan_city, 'city', zoom=2, title='City inferred')
Number of rows with null values for city, but not for lat/lon and county: 22995
incidents_df = new_incidents_df
incidents_df[['latitude', 'county', 'city']].isna().groupby(['latitude', 'county', 'city']).size().to_frame().rename(columns={0:'count'})
| count | |||
|---|---|---|---|
| latitude | county | city | |
| False | False | False | 195959 |
| True | 20746 | ||
| True | False | 9877 | |
| True | 3 | ||
| True | False | True | 1762 |
| True | True | 11034 |
We check if the attribute congressional_district is numbered consistently (with '0' for states with only one congressional district). To do so we use the dataset containing the data about elections in the period of interest (congressional districts are redrawn when (year%10)==0):
at_large_states = elections_df[
(elections_df['year'].between(2013, 2018, inclusive="both")) &
(elections_df['congressional_district']==0)
]['state'].unique()
at_large_states
array(['ALASKA', 'DELAWARE', 'DISTRICT OF COLUMBIA', 'MONTANA',
'NORTH DAKOTA', 'SOUTH DAKOTA', 'VERMONT', 'WYOMING'], dtype=object)
Now we check if states with a '0' as congressional district are the same states with only one congressional district in the dataset containing the data about elections:
zero_congress_states_inc = incidents_df[incidents_df['congressional_district']==0]['state'].unique()
set(zero_congress_states_inc).issubset(set(at_large_states))
True
We check if states with a single congressional district are always numbered with '0' in the dataset containing the data about elections:
incidents_df[(incidents_df['state'] == at_large_states.any()) & (incidents_df['congressional_district']!=0)].size==0
False
Since they are not, we fix this issue:
incidents_df.loc[incidents_df['state'].isin(at_large_states), 'congressional_district'] = 0
We check if the range of the attribute congressional_district is consistent with the number of congressional districts in the dataset containing the data about elections:
incidents_df['state'] = incidents_df['state'].str.upper()
wrong_congr_states = elections_df.groupby('state')['congressional_district'].max()>=incidents_df.groupby('state')['congressional_district'].max()
for state in wrong_congr_states[wrong_congr_states==False].index:
print(f"State {state} has more districts in the incidents data than in the elections data")
State KENTUCKY has more districts in the incidents data than in the elections data State OREGON has more districts in the incidents data than in the elections data State WEST VIRGINIA has more districts in the incidents data than in the elections data
We display the rows with inconsistent congressional district in Kentucky:
incidents_df[
(incidents_df['state']=='KENTUCKY') &
(incidents_df['congressional_district'] >
elections_df[(elections_df['state']=='KENTUCKY') & (elections_df['year']>2012)]['congressional_district'].max())
]
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 69605 | 2014-07-30 | KENTUCKY | Lexington | Woodland Avenue near Main Street | NaN | NaN | 8.0 | 79.0 | 10.0 | 23.0 | Adult 18+ | Male | 23.0 | 23.0 | 23.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0 | 0 | 0.0 | 2.0 | 2.0 | man robbed at gunpoint | Armed robbery with injury/death and/or evidence of DGU found | NaN | 2014.0 | 2014-07-30 | 7 | July | 30 | 2 | Wednesday | NaN | NaN | -1 | -1 | 1 | NaN | NaN |
Searching online we found that Kentucky, in that period, had 6 congressional districts, so we'll set to nan the congressional district for the row above:
incidents_df.loc[
(incidents_df['state']=='KENTUCKY') &
(incidents_df['congressional_district'] >
elections_df[(elections_df['state']=='KENTUCKY') & (elections_df['year']>2012)]['congressional_district'].max()),
'congressional_district'] = np.nan
We display the rows with inconsistent congressional district in Oregon:
incidents_df[
(incidents_df['state']=='OREGON') &
(incidents_df['congressional_district'] >
elections_df[(elections_df['state']=='OREGON') & (elections_df['year']>2012)]['congressional_district'].max())
]
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 205517 | 2015-06-24 | OREGON | Dayton | NaN | NaN | NaN | 10.0 | 39.0 | 5.0 | 43.0 | Adult 18+ | Male | 43.0 | 43.0 | 43.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0 | 0 | 1.0 | 0.0 | 1.0 | NaN | Non-Shooting Incident | Drug involvement | 2015.0 | 2015-06-24 | 6 | June | 24 | 2 | Wednesday | NaN | NaN | -1 | 1 | 0 | NaN | NaN |
| 216127 | 2014-02-06 | OREGON | Marysville | NaN | NaN | NaN | 9.0 | 66.0 | 45.0 | 27.0 | Adult 18+ | Male | 27.0 | 27.0 | 27.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0 | 0 | 0.0 | 1.0 | 1.0 | NaN | Armed robbery with injury/death and/or evidence of DGU found | NaN | 2014.0 | 2014-02-06 | 2 | February | 6 | 3 | Thursday | NaN | NaN | -1 | 1 | 0 | NaN | NaN |
Searching online we found that Oregon, in that period, had 5 congressional districts, so we'll set to nan the congressional district for the rows above:
incidents_df.loc[
(incidents_df['state']=='OREGON') &
(incidents_df['congressional_district'] >
elections_df[(elections_df['state']=='OREGON') & (elections_df['year']>2012)]['congressional_district'].max()),
'congressional_district'] = np.nan
We display the rows with inconsistent congressional district in West Virginia:
incidents_df[
(incidents_df['state']=='WEST VIRGINIA') &
(incidents_df['congressional_district'] >
elections_df[(elections_df['state']=='WEST VIRGINIA') & (elections_df['year']>2012)]['congressional_district'].max())
]
| date | state | city_or_county | address | latitude | longitude | congressional_district | state_house_district | state_senate_district | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 221378 | 2017-05-30 | WEST VIRGINIA | Hartford | NaN | NaN | NaN | 6.0 | 94.0 | 30.0 | 52.0 | Adult 18+ | Male | 32.0 | 42.0 | 52.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0 | 1 | 0.0 | 1.0 | 2.0 | Clay Co, father shot son with same name, son threat with shotgun, father shot with 22 cal revolver | Shot - Wounded/Injured | Defensive Use | 2017.0 | 2017-05-30 | 5 | May | 30 | 1 | Tuesday | NaN | NaN | -1 | -1 | 0 | NaN | NaN |
Searching online we found that West Virginia, in that period, had 3 congressional districts, so we'll set to nan the congressional district for the row above:
incidents_df.loc[
(incidents_df['state']=='WEST VIRGINIA') &
(incidents_df['congressional_district'] >
elections_df[(elections_df['state']=='WEST VIRGINIA') & (elections_df['year']>2012)]['congressional_district'].max()),
'congressional_district'] = np.nan
We check whether given a certain value for the attributes latitude and longitude, the attribute congressional_district has always the same value:
incidents_df[incidents_df['congressional_district'].notnull()].groupby(['latitude', 'longitude'])['congressional_district'].unique()[lambda x: x.str.len() > 1].to_frame().rename(columns={0:'count'}).sample(5, random_state=1)
| congressional_district | ||
|---|---|---|
| latitude | longitude | |
| 36.6732 | -76.9284 | [3.0, 4.0] |
| 37.5478 | -77.4219 | [4.0, 3.0] |
| 30.4153 | -84.2972 | [2.0, 5.0] |
| 37.0438 | -76.3586 | [2.0, 3.0] |
| 36.1162 | -79.4353 | [4.0, 6.0] |
All these points are probably errors, due to the fact that they are near the border between two congressional districts. We correct them setting the most frequent value for the attribute congressional_district (setting that value also for the entries with missing values):
corrected_congr_districts = incidents_df[
~incidents_df['congressional_district'].isna()
].groupby(['latitude', 'longitude'])['congressional_district'].agg(lambda x: x.value_counts().index[0])
incidents_df = incidents_df.merge(corrected_congr_districts, on=['latitude', 'longitude'], how='left')
# where latitude and longitude are null, keep the original value
incidents_df['congressional_district_y'].fillna(incidents_df['congressional_district_x'], inplace=True)
incidents_df.rename(columns={'congressional_district_y':'congressional_district'}, inplace=True)
incidents_df.drop(columns=['congressional_district_x'], inplace=True)
In the same city or county there could be different values for the attribute congressional_district (this is not an error, is actually possible according to the USA law):
incidents_df[incidents_df['congressional_district'].notna()].groupby(['state', 'city_or_county'])['congressional_district'].unique()[lambda x: x.str.len() > 1].to_frame()
| congressional_district | ||
|---|---|---|
| state | city_or_county | |
| ALABAMA | Adamsville | [6.0, 7.0] |
| Auburn | [3.0, 1.0] | |
| Bessemer | [7.0, 6.0] | |
| Bessemer (Hueytown) | [7.0, 6.0] | |
| Birmingham | [7.0, 6.0, 4.0] | |
| ... | ... | ... |
| WISCONSIN | Nekoosa | [7.0, 3.0] |
| Platteville | [2.0, 3.0] | |
| Richfield | [7.0, 5.0] | |
| Rock (county) | [1.0, 2.0] | |
| Wisconsin Dells | [6.0, 3.0, 2.0] |
1598 rows × 1 columns
We print the unique values the attribute state_house_district can take on:
house_districts = incidents_df['state_house_district'].unique()
house_districts.sort()
house_districts
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22.,
23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33.,
34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44.,
45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55.,
56., 57., 58., 59., 60., 61., 62., 63., 64., 65., 66.,
67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77.,
78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88.,
89., 90., 91., 92., 93., 94., 95., 96., 97., 98., 99.,
100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110.,
111., 112., 113., 114., 115., 116., 117., 118., 119., 120., 121.,
122., 123., 124., 125., 126., 127., 128., 129., 130., 131., 132.,
133., 134., 135., 136., 137., 138., 139., 140., 141., 142., 143.,
144., 145., 146., 147., 148., 149., 150., 151., 152., 153., 154.,
155., 156., 157., 158., 159., 160., 161., 162., 163., 164., 165.,
166., 167., 168., 169., 170., 171., 172., 173., 174., 175., 176.,
177., 178., 179., 180., 181., 182., 183., 184., 185., 186., 187.,
188., 189., 190., 191., 192., 193., 194., 195., 196., 197., 198.,
199., 200., 201., 202., 203., 204., 205., 206., 207., 208., 209.,
210., 211., 212., 214., 215., 216., 217., 218., 219., 303., 411.,
413., 503., 506., 510., 511., 512., 513., 514., 515., 516., 517.,
518., 521., 524., 529., 530., 533., 537., 601., 602., 603., 605.,
608., 610., 614., 615., 618., 622., 623., 624., 705., 708., 709.,
713., 714., 716., 717., 718., 720., 721., 723., 724., 725., 726.,
727., 729., 801., 804., 805., 808., 809., 811., 813., 814., 901.,
nan])
Also this attribute has some errors because the maximum number of state house districts should be 204 (for New Hampshire, see here). For now we won't correct this error beacuse this attribute is not useful for our analysis.
We check if given a certain value for the attributes latitude and longitude, the attribute state_house_district has always the same value:
incidents_df[incidents_df['state_house_district'].notnull()].groupby(
['latitude', 'longitude'])['state_house_district'].unique()[lambda x: x.str.len() > 1].to_frame()
| state_house_district | ||
|---|---|---|
| latitude | longitude | |
| 25.7789 | -80.1980 | [113.0, 109.0] |
| 27.7805 | -97.3983 | [34.0, 32.0] |
| 29.6807 | -95.4252 | [134.0, 146.0] |
| 29.7364 | -95.5887 | [137.0, 133.0] |
| 29.9462 | -90.1198 | [91.0, 98.0] |
| ... | ... | ... |
| 44.8293 | -69.7514 | [111.0, 86.0] |
| 45.5044 | -122.5030 | [51.0, 47.0] |
| 45.5263 | -122.5680 | [45.0, 46.0] |
| 45.5740 | -122.6380 | [43.0, 44.0] |
| 61.2228 | -149.7920 | [17.0, 14.0] |
174 rows × 1 columns
We correct the errors:
corrected_house_districts = incidents_df[
incidents_df['state_house_district'].notnull()
].groupby(['latitude', 'longitude'])['state_house_district'].agg(lambda x: x.value_counts().index[0])
incidents_df = incidents_df.merge(corrected_house_districts, on=['latitude', 'longitude'], how='left')
incidents_df['state_house_district_y'].fillna(incidents_df['state_house_district_x'], inplace=True)
incidents_df.rename(columns={'state_house_district_y':'state_house_district'}, inplace=True)
incidents_df.drop(columns=['state_house_district_x'], inplace=True)
We now print the unique values the attribute state_senate_district can take on:
senate_districts = incidents_df['state_senate_district'].unique()
senate_districts.sort()
senate_districts
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13.,
14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39.,
40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51., 52.,
53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65.,
66., 67., 94., nan])
And again we notice some errors because the maximum number of state senate districts should be 67 (for Minnesota, see here). For now we won't correct this error beacuse this attribute is not useful for our analysis.
We correct other possible errors as above:
corrected_senate_districts = incidents_df[
incidents_df['state_senate_district'].notnull()
].groupby(['latitude', 'longitude'])['state_senate_district'].agg(lambda x: x.value_counts().index[0])
incidents_df = incidents_df.merge(corrected_senate_districts, on=['latitude', 'longitude'], how='left')
incidents_df['state_senate_district_y'].fillna(incidents_df['state_senate_district_x'], inplace=True)
incidents_df.rename(columns={'state_senate_district_y':'state_senate_district'}, inplace=True)
incidents_df.drop(columns=['state_senate_district_x'], inplace=True)
We check whether given a state, city_or_county and state_senate_district, the value of the attribute congressional_district is always the same:
incidents_df[incidents_df['congressional_district'].notnull()].groupby(
['state', 'city_or_county', 'state_senate_district'])['congressional_district'].unique()[lambda x: x.str.len() > 1].shape[0]==0
False
Hence we cannot recover the missing values for the attribute congressional_district from the values of state_senate_district. We check the same for the attribute state_house_district:
incidents_df[incidents_df['congressional_district'].notnull()].groupby(
['state', 'city_or_county', 'state_house_district'])['congressional_district'].unique()[lambda x: x.str.len() > 1].shape[0]==0
False
We cannot recover the missing values for the attribute congressional_district from the values of state_house_district either.
We could, instead, recover the missing values from the entries with "similar" latitude and longitude. To explore this possibility we first plot on a map the dislocation of the incidents, coloring them according to the value of the attribute congressional_district:
plot_scattermap_plotly(
incidents_df,
'congressional_district',
black_nan=True,
zoom=2,
height=800,
width=800,
title="USA Congressional districts"
)
Many points with missing congressional_district are often "surrounded" by points belonging to the same congressional district. We could, therefore, use KNN classifier to recover those values.
We'll do this first for the state of Alabama, showing the results with some plots. Later we will do the same for all the other states.
We plot the distribution of the values of the attribute congressional_district for the state of Alabama:
plot_scattermap_plotly(
incidents_df[incidents_df['state']=='ALABAMA'],
attribute='congressional_district',
black_nan=True,
width=500,
height=600,
zoom=5.5,
title="Alabama incidents by Congressional Districts",
legend_title="Congressional District"
)
We define a function to prepare the data for the classification task:
def build_X_y_for_district_inference(incidents_df):
X_train = np.concatenate((
incidents_df[
(incidents_df['congressional_district'].notna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna())
]['latitude'].values.reshape(-1, 1),
incidents_df[
(incidents_df['congressional_district'].notna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna())
]['longitude'].values.reshape(-1, 1)),
axis=1
)
X_test = np.concatenate((
incidents_df[
(incidents_df['congressional_district'].isna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna())
]['latitude'].values.reshape(-1, 1),
incidents_df[
(incidents_df['congressional_district'].isna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna())
]['longitude'].values.reshape(-1, 1)),
axis=1
)
y_train = incidents_df[
(incidents_df['congressional_district'].notna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna())
]['congressional_district'].values
return X_train, X_test, y_train
We define the function to compute the geodesic distance to pass to the KNN classifier:
def geodesic_distance(point1, point2):
return geopy_distance.geodesic(point1, point2).km
Now we are ready to apply the classifier (using K=1):
X_train, X_test, y_train = build_X_y_for_district_inference(incidents_df[incidents_df['state']=="ALABAMA"])
knn_clf = KNeighborsClassifier(n_neighbors=1, metric=geodesic_distance)
knn_clf.fit(X_train, y_train)
knn_pred = knn_clf.predict(X_test)
incidents_df['KNN_congressional_district'] = incidents_df['congressional_district']
incidents_df.loc[
(incidents_df['state']=="ALABAMA") &
(incidents_df['congressional_district'].isna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna()),
'KNN_congressional_district'
] = knn_pred
We plot the results:
plot_scattermap_plotly(
incidents_df[incidents_df['state']=='ALABAMA'],
attribute='KNN_congressional_district',
width=500,
height=600,
zoom=5.5,
title="Alabama incidents by Congressional Districts",
legend_title="Congressional District"
)
To improve the visualization, we plot on the map the decision boundaries of the classifier. To do so, we convert latitude and longitude to a 2D space:
transformer = Transformer.from_crs("EPSG:4326", "EPSG:26929", always_xy=True) # EPSG:26929 identifies the projected coordinate system for Alabama East (had to choose between E,W,N,S)
X_train_converted = []
for i in range(X_train.shape[0]):
x, y = transformer.transform(X_train[i][1], X_train[i][0])
X_train_converted.append([x,y])
X_train_converted = np.array(X_train_converted)
And now we train the classifier using the euclidean distance:
knn_eu_clf = KNeighborsClassifier(n_neighbors=1, metric='euclidean')
knn_eu_clf.fit(X_train_converted, y_train)
KNeighborsClassifier(metric='euclidean', n_neighbors=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(metric='euclidean', n_neighbors=1)
We plot the boundaries of the classifier:
alabama_color_map = {
1:'red',
2:'orange',
3:'yellow',
4:'green',
5:'lightblue',
6:'blue',
7:'purple'
}
plot_clf_decision_boundary(knn_eu_clf, X_train_converted, y_train, alabama_color_map, "KNN Alabama borders")
We can now compare the boundaries built by the classifier with the actual boundaries (this map was taken here):
![]()
The result is satisfactory. However, it is important to highlight that if there are no examples available for a specific district, we won't assign the correct label to the points in that districts. We check how many congressional districts have 2 or less examples:
incidents_df.groupby(['state', 'congressional_district']).size()[lambda x: x <= 2]
state congressional_district NEW JERSEY 14.0 1 dtype: int64
By the way, missclassification can still occurr, depending on the position of the available examples w.r.t the position of the points to classify. Aware of this limitation, we proceed to apply this method to the other states and plot the result:
if LOAD_DATA_FROM_CHECKPOINT:
incidents_df = load_checkpoint('checkpoint_3', date_cols=['date', 'date_original'])
else:
for state in incidents_df['state'].unique():
if state != "ALABAMA":
print(f"{state} done.")
X_train, X_test, y_train = build_X_y_for_district_inference(incidents_df[incidents_df['state']==state])
if X_test.shape[0] == 0:
continue
knn_clf.fit(X_train, y_train)
knn_pred = knn_clf.predict(X_test)
incidents_df.loc[
(incidents_df['state']==state) &
(incidents_df['congressional_district'].isna()) &
(incidents_df['latitude'].notna()) &
(incidents_df['longitude'].notna()),
'KNN_congressional_district'
] = knn_pred
incidents_df.drop(columns=['congressional_district'], inplace=True)
incidents_df.rename(columns={'KNN_congressional_district':'congressional_district'}, inplace=True)
save_checkpoint(incidents_df, 'checkpoint_3')
plot_scattermap_plotly(
incidents_df,
'congressional_district',
zoom=2,
height=800,
width=800,
title="USA Congressional districts (after inference)"
)
We now plot on a map the location of the incidents, coloring them according to the value of the attribute state_senate_district and state_house_district, to assess wheter we can apply the same method to recover missing values:
plot_scattermap_plotly(
incidents_df,
'state_senate_district',
black_nan=True,
zoom=2,
height=800,
width=800,
title="USA State senate districts"
)
plot_scattermap_plotly(
incidents_df,
'state_house_district',
black_nan=True,
zoom=2,
height=800,
width=800,
title="USA State house districts"
)
These attributes have a lot of missing values, sometimes spread over large areas where there are no other points. Given this scarcity of training examples, we cannot apply the same method to recover the missing values.
Finally we visualize the most frequent addresses:
incidents_df.groupby(['address']).size().sort_values(ascending=False)[:50].plot(
kind='bar',
figsize=(10,6),
title='Counts of the addresses with the 50 highest number of incidents'
);
Many of the most frequent addresses are located in airports.
We display a concise summary of the attributes related to age, gender and number of participants:
participants_columns = ['participant_age1', 'participant_age_group1', 'participant_gender1',
'min_age_participants', 'avg_age_participants', 'max_age_participants',
'n_participants_child', 'n_participants_teen', 'n_participants_adult',
'n_males', 'n_females',
'n_killed', 'n_injured', 'n_arrested', 'n_unharmed',
'n_participants']
age_df = incidents_df[participants_columns]
age_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 239381 entries, 0 to 239380 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 participant_age1 147360 non-null float64 1 participant_age_group1 197508 non-null object 2 participant_gender1 203266 non-null object 3 min_age_participants 159107 non-null float64 4 avg_age_participants 159149 non-null float64 5 max_age_participants 159065 non-null float64 6 n_participants_child 197518 non-null float64 7 n_participants_teen 197521 non-null float64 8 n_participants_adult 197522 non-null float64 9 n_males 203266 non-null float64 10 n_females 203266 non-null float64 11 n_killed 239381 non-null int64 12 n_injured 239381 non-null int64 13 n_arrested 211988 non-null float64 14 n_unharmed 211988 non-null float64 15 n_participants 239381 non-null float64 dtypes: float64(12), int64(2), object(2) memory usage: 31.0+ MB
age_df['participant_age_group1'].unique()
array(['Adult 18+', nan, 'Teen 12-17', 'Child 0-11'], dtype=object)
We check if we have entries with non-null values for participant_age1 but NaN for participant_age_group1.
age_df[age_df['participant_age1'].notna() & age_df['participant_age_group1'].isna()]
| participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1279 | 29.0 | NaN | Male | 29.0 | 29.0 | 29.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 0 | NaN | NaN | 2.0 |
| 3251 | 34.0 | NaN | Male | 34.0 | 34.0 | 34.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 0 | NaN | NaN | 1.0 |
| 3480 | 20.0 | NaN | Female | 20.0 | 20.0 | 20.0 | NaN | NaN | NaN | 0.0 | 1.0 | 0 | 1 | 0.0 | 0.0 | 1.0 |
| 4035 | 23.0 | NaN | Male | 23.0 | 23.0 | 23.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 1.0 |
| 6445 | 18.0 | NaN | Male | 18.0 | 18.0 | 18.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 229593 | 34.0 | NaN | Male | 34.0 | 44.0 | 55.0 | -252.0 | NaN | -355.0 | 2.0 | 0.0 | 1 | 0 | 1.0 | 0.0 | 2.0 |
| 233599 | 25.0 | NaN | Male | 25.0 | 25.0 | 25.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 0 | NaN | NaN | 1.0 |
| 235505 | 48.0 | NaN | Male | 48.0 | 48.0 | 48.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 1.0 |
| 236181 | 25.0 | NaN | Male | 25.0 | 25.0 | 25.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0 | 1 | 0.0 | 0.0 | 1.0 |
| 238254 | 33.0 | NaN | Male | 31.0 | 32.0 | 33.0 | NaN | NaN | NaN | 1.0 | 2.0 | 0 | 0 | 0.0 | 3.0 | 4.0 |
196 rows × 16 columns
These 126 values can be inferred.
Below, we provide a brief summary of the operations we performed to correct missing and inconsistent values.
First of all, we converted all the consistent values to integers. All the out of range values (e.g. nagative values or improbable ages) were set to NaN. We considered the maximum possible age to be 122 years, as it is the age reached by Jeanne Louise Calment, the world's oldest person.
To identify inconsistencies in the data related to the minimum, maximum, average age of participants, and to the composition of the age groups we checked if:
n_participants_child $+$ n_participants_teen $+$ n_participants_adult $>$ 0
$if$ min_age_participants $<$ 12 $then$ n_participants_child $>$ 0
$if$ min_age_participants $\geq$ 18 $then$ n_participants_adult $>$ 0
$if$ max_age_participants $<$ 12 $then$ n_participants_child $>$ 0 and n_participants_teen $=$ 0 and n_participants_adult $=$ 0
Note that: child = 0-11, teen = 12-17, adult = 18+
To identify inconsistencies in the data related to the number of participants we checked if:
Values related to participant_age_group1 and participant_gender1 have been binarized using one-hot encoding, thus creating the boolean features participant1_child, participant1_teen, participant1_adult, participant1_male, participant1_female.
To identify other potential inconsistencies we did the following checks:
$if$ participant_age1 $\geq$ 18 $then$ participant_age_group1 $==$ Adult
$if$ participant_age_group1 $==$ Child $then$ n_participants_child $>$ 0
$if$ participant_age_group1 $==$ Adult $then$ n_participants_adult $>$ 0
$if$ participant_gender1 $==$ Male $then$ n_males $>$ 0
We kept track of data consistency by using the following variables (the variables were set to True if data were consistent, False if they were not, or NaN when data was missing):
The chunck of code below performes the specified checks:
from TASK_1.data_preparation_utils import check_age_gender_data_consistency
if LOAD_DATA_FROM_CHECKPOINT: # load data
age_temporary_df = load_checkpoint('checkpoint_tmp')#, date_cols=['date', 'date_original'])
else: # compute data
age_temporary_df = age_df.apply(lambda row: check_age_gender_data_consistency(row), axis=1)
save_checkpoint(age_temporary_df, 'checkpoint_tmp') # save data
In the following, we will display and visualize statistics about the data consistency.
age_temporary_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 239381 entries, 0 to 239380 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 participant_age1 147358 non-null float64 1 participant1_child 197704 non-null object 2 participant1_teen 197704 non-null object 3 participant1_adult 197704 non-null object 4 participant1_male 203265 non-null object 5 participant1_female 203265 non-null object 6 min_age_participants 147362 non-null float64 7 avg_age_participants 147358 non-null float64 8 max_age_participants 147359 non-null float64 9 n_participants_child 197514 non-null float64 10 n_participants_teen 197511 non-null float64 11 n_participants_adult 197514 non-null float64 12 n_males 203266 non-null float64 13 n_females 203266 non-null float64 14 n_killed 239381 non-null int64 15 n_injured 239381 non-null int64 16 n_arrested 211988 non-null float64 17 n_unharmed 211988 non-null float64 18 n_participants 214751 non-null float64 19 consistency_age 239381 non-null bool 20 consistency_n_participant 211988 non-null object 21 consistency_gender 203266 non-null object 22 consistency_participant1 147358 non-null object 23 consistency_participants1_wrt_n_participants 203285 non-null object 24 participant1_age_consistency_wrt_all_data 147358 non-null object 25 participant1_age_range_consistency_wrt_all_data 197704 non-null object 26 participant1_gender_consistency_wrt_all_data 203265 non-null object 27 nan_values 239381 non-null bool dtypes: bool(2), float64(12), int64(2), object(12) memory usage: 49.8+ MB
age_temporary_df[['consistency_age', 'consistency_n_participant', 'consistency_gender',
'consistency_participant1', 'consistency_participants1_wrt_n_participants',
'participant1_age_consistency_wrt_all_data', 'participant1_age_range_consistency_wrt_all_data',
'participant1_gender_consistency_wrt_all_data', 'nan_values']].describe()
| consistency_age | consistency_n_participant | consistency_gender | consistency_participant1 | consistency_participants1_wrt_n_participants | participant1_age_consistency_wrt_all_data | participant1_age_range_consistency_wrt_all_data | participant1_gender_consistency_wrt_all_data | nan_values | |
|---|---|---|---|---|---|---|---|---|---|
| count | 239381 | 211988 | 203266 | 147358 | 203285 | 147358 | 197704 | 203265 | 239381 |
| unique | 2 | 1 | 2 | 2 | 2 | 1 | 2 | 1 | 2 |
| top | True | True | True | True | True | True | True | True | False |
| freq | 234328 | 211988 | 187103 | 146259 | 203095 | 147358 | 197514 | 203265 | 145602 |
print('Number of rows with null values: ', age_temporary_df[age_temporary_df['nan_values'] == True].shape[0])
print('Number of rows with inconsistent values in age data: ', age_temporary_df[age_temporary_df['consistency_age'] == False].shape[0])
print('Number of rows with inconsistent values in number of participants data: ', age_temporary_df[age_temporary_df[
'consistency_n_participant'] == False].shape[0])
print('Number of rows with inconsistent values in gender data: ', age_temporary_df[age_temporary_df['consistency_gender'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 data: ', age_temporary_df[age_temporary_df[
'consistency_participant1'] == False].shape[0])
print('Number of rows with inconsistent values for participants1: ', age_temporary_df[age_temporary_df[
'consistency_participant1'] == False].shape[0])
print('Number of rows with NaN values for participants1: ', age_temporary_df[age_temporary_df[
'consistency_participant1'] == np.nan].shape[0])
print('Number of rows with inconsistent values in participants1 wrt all other data: ', age_temporary_df[age_temporary_df[
'consistency_participants1_wrt_n_participants'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 wrt age data: ', age_temporary_df[age_temporary_df[
'participant1_age_consistency_wrt_all_data'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 wrt age range data: ', age_temporary_df[age_temporary_df[
'participant1_age_range_consistency_wrt_all_data'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 wrt gender data: ', age_temporary_df[age_temporary_df[
'participant1_gender_consistency_wrt_all_data'] == False].shape[0])
print('Number of rows with null values in age data: ', age_temporary_df[age_temporary_df['consistency_age'].isna()].shape[0])
print('Number of rows with null values in number of participants data: ', age_temporary_df[age_temporary_df[
'consistency_n_participant'].isna()].shape[0])
print('Number of rows with null values in gender data: ', age_temporary_df[age_temporary_df['consistency_gender'].isna()].shape[0])
print('Number of rows with null values in participants1 data: ', age_temporary_df[age_temporary_df[
'consistency_participant1'].isna()].shape[0])
print('Number of rows with all null data: ', age_temporary_df.isnull().all(axis=1).sum())
Number of rows with null values: 93779 Number of rows with inconsistent values in age data: 5053 Number of rows with inconsistent values in number of participants data: 0 Number of rows with inconsistent values in gender data: 16163 Number of rows with inconsistent values in participants1 data: 1099 Number of rows with inconsistent values for participants1: 1099 Number of rows with NaN values for participants1: 0 Number of rows with inconsistent values in participants1 wrt all other data: 190 Number of rows with inconsistent values in participants1 wrt age data: 0 Number of rows with inconsistent values in participants1 wrt age range data: 190 Number of rows with inconsistent values in participants1 wrt gender data: 0 Number of rows with null values in age data: 0 Number of rows with null values in number of participants data: 27393 Number of rows with null values in gender data: 36115 Number of rows with null values in participants1 data: 92023 Number of rows with all null data: 0
We notice that:
We plot the age distribution of participant1 and compare it to the distribution of the minimum and maximum participants' age for each group:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20, 6), sharey=True)
ax0.hist(age_temporary_df['participant_age1'], bins=100, edgecolor='black', linewidth=0.8)
ax0.set_xlabel('Age')
ax0.set_ylabel('Frequency')
ax0.set_title('Distribution of age participant1')
ax1.hist(age_temporary_df['min_age_participants'], bins=100, edgecolor='black', linewidth=0.8)
ax1.set_xlabel('Age')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of min age participants')
ax2.hist(age_temporary_df['max_age_participants'], bins=100, edgecolor='black', linewidth=0.8)
ax2.set_xlabel('Age')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of max age participants')
plt.show()
The similar shapes of the distributions provides confirmation that the data pertaining to participant1 is accurate and reliable. Therefore, we can confidently use participant1's data to fill gaps in cases incidents involved a single participant.
We visualize the number of unique values for the cardinality of participants in each incident and provided a brief summary of this feature:
print('Values of n_participants: ', age_temporary_df['n_participants'].unique())
age_temporary_df['n_participants'].describe().to_frame()
Values of n_participants: [ 1. 2. 5. 3. 4. nan 6. 7. 8. 10. 9. 12. 17. 14. 11. 28. 24. 15. 37. 16. 19. 13. 35. 18. 29. 20. 21. 32. 27. 52. 22. 26. 47. 30. 23. 63. 103.]
| n_participants | |
|---|---|
| count | 214751.000000 |
| mean | 1.826520 |
| std | 1.185155 |
| min | 1.000000 |
| 25% | 1.000000 |
| 50% | 2.000000 |
| 75% | 2.000000 |
| max | 103.000000 |
We visualize the distribution of the number of participants for each incident using a log scale:
plt.figure(figsize=(20, 5))
plt.bar(incidents_df.groupby('n_participants')['n_participants'].count().index, incidents_df.groupby('n_participants')['n_participants'].count(),
alpha=0.8, edgecolor='black', linewidth=0.8)
plt.yscale('log')
plt.xlabel('Number of participants for incidents')
plt.ylabel('Number of incidents')
plt.plot([0.5, 103.5], [1, 1], '--', color='magenta', label='1 incident')
plt.plot([0.5, 103.5], [2, 2], '--', color='red', label='2 incidents')
plt.plot([0.5, 103.5], [10, 10], '--', color='green', label='10 incidents')
plt.plot([0.5, 103.5], [100, 100], '--', color='blue', label='100 incidents')
plt.xticks(range(1, 104, 2), range(1, 104, 2))
plt.legend()
plt.show()
We will now correct inconsistencies in the following manner:
from TASK_1.data_preparation_utils import set_gender_age_consistent_data
if LOAD_DATA_FROM_CHECKPOINT:
with zipfile.ZipFile('checkpoints/checkpoint_4.csv.zip', 'r') as zip_ref:
zip_ref.extractall('checkpoints/')
incidents_df = load_checkpoint('checkpoint_4', date_cols=['date', 'date_original'])
else:
new_age_df = age_temporary_df.apply(lambda row: set_gender_age_consistent_data(row), axis=1)
incidents_df[new_age_df.columns] = new_age_df[new_age_df.columns]
save_checkpoint(incidents_df, 'checkpoint_4')
We visualize the data after the corrections:
incidents_df.sample(2, random_state=1)
| date | state | city_or_county | address | latitude | longitude | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | state_house_district | state_senate_district | congressional_district | participant1_child | participant1_teen | participant1_adult | participant1_male | participant1_female | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 143491 | 2016-05-23 | FLORIDA | Port Charlotte | Conway Boulevard | 26.9829 | -82.0881 | 18.0 | Adult 18+ | Male | 18.0 | 19.0 | 20.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | shot at bicyclist from blue vehicle | Shots Fired - No Injuries | Drive-by (car to street, car to car) | 2016.0 | 2016-05-23 | 5 | May | 23 | 0 | Monday | Charlotte County | Port Charlotte | 1 | 1 | 1 | 1.000000e-07 | place | 75.0 | 28.0 | 17.0 | False | False | True | True | False |
| 110344 | 2017-02-11 | PENNSYLVANIA | Harrisburg | 2400 block of Berryhill St | 40.2645 | -76.8475 | 24.0 | Adult 18+ | Male | 24.0 | 24.0 | 24.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | Shot - Dead (murder, accidental, suicide) | Suicide^ | 2017.0 | 2017-02-11 | 2 | February | 11 | 5 | Saturday | Dauphin County | Harrisburg | 1 | 1 | 1 | 1.000000e-07 | place | NaN | NaN | 4.0 | False | False | True | True | False |
incidents_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 239381 entries, 0 to 239380 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 216373 non-null datetime64[ns] 1 state 239381 non-null object 2 city_or_county 239381 non-null object 3 address 222934 non-null object 4 latitude 226585 non-null float64 5 longitude 226585 non-null float64 6 participant_age1 146259 non-null float64 7 participant_age_group1 197508 non-null object 8 participant_gender1 203266 non-null object 9 min_age_participants 144857 non-null float64 10 avg_age_participants 144855 non-null float64 11 max_age_participants 144856 non-null float64 12 n_participants_child 197948 non-null float64 13 n_participants_teen 197948 non-null float64 14 n_participants_adult 197948 non-null float64 15 n_males 198137 non-null float64 16 n_females 198137 non-null float64 17 n_killed 239381 non-null float64 18 n_injured 239381 non-null float64 19 n_arrested 211988 non-null float64 20 n_unharmed 211988 non-null float64 21 n_participants 214751 non-null float64 22 notes 158558 non-null object 23 incident_characteristics1 239055 non-null object 24 incident_characteristics2 141775 non-null object 25 year 216373 non-null float64 26 date_original 239381 non-null datetime64[ns] 27 month 239381 non-null int64 28 month_name 239381 non-null object 29 day 239381 non-null int64 30 day_of_week 239381 non-null int64 31 day_of_week_name 239381 non-null object 32 county 218467 non-null object 33 city 205836 non-null object 34 state_consistency 239381 non-null int64 35 county_consistency 239381 non-null int64 36 address_consistency 239381 non-null int64 37 location_importance 226585 non-null float64 38 address_type 226585 non-null object 39 state_house_district 201845 non-null float64 40 state_senate_district 208065 non-null float64 41 congressional_district 231699 non-null float64 42 participant1_child 196415 non-null object 43 participant1_teen 196415 non-null object 44 participant1_adult 196415 non-null object 45 participant1_male 202178 non-null object 46 participant1_female 202178 non-null object dtypes: datetime64[ns](2), float64(21), int64(6), object(18) memory usage: 87.7+ MB
print('Number of rows in which all data are null: ', incidents_df.isnull().all(axis=1).sum())
print('Number of rows with some null data: ', incidents_df.isnull().any(axis=1).sum())
print('Number of rows in which number of participants is null: ', incidents_df[incidents_df['n_participants'].isnull()].shape[0])
print('Number of rows in which number of participants is 0: ', incidents_df[incidents_df['n_participants'] == 0].shape[0])
print('Number of rows in which number of participants is null and n_killed is not null: ', incidents_df[
incidents_df['n_participants'].isnull() & incidents_df['n_killed'].notnull()].shape[0])
print('Total rows with null value for n_participants: ', incidents_df['n_participants'].isnull().sum())
print('Total rows with null value for n_participants_child: ', incidents_df['n_participants_child'].isnull().sum())
print('Total rows with null value for n_participants_teen: ', incidents_df['n_participants_teen'].isnull().sum())
print('Total rows with null value for n_participants_adult: ', incidents_df['n_participants_adult'].isnull().sum())
print('Total rows with null value for n_males: ', incidents_df['n_males'].isnull().sum())
print('Total rows with null value for n_females: ', incidents_df['n_females'].isnull().sum())
Number of rows in which all data are null: 0 Number of rows with some null data: 196882 Number of rows in which number of participants is null: 24630 Number of rows in which number of participants is 0: 0 Number of rows in which number of participants is null and n_killed is not null: 24630 Total rows with null value for n_participants: 24630 Total rows with null value for n_participants_child: 41433 Total rows with null value for n_participants_teen: 41433 Total rows with null value for n_participants_adult: 41433 Total rows with null value for n_males: 41244 Total rows with null value for n_females: 41244
sns.heatmap(incidents_df.isnull(), cbar=False)
<Axes: >
We recovered all the data related to age and gender. In 98973 entries, at most a value is missing.
We now explore the distribution of the total number of participants and the number of participants per age group. Once again we use a logaritmic scale in the y-axis:
def plot_hist(df_column, n_bin=100, density=True, title=None, y_label=None, color=None, y_logscale=False):
def iqr_fence(x):
q1 = x.quantile(0.25)
med = x.quantile(0.5)
q3 = x.quantile(0.75)
IQR = q3 - q1
u = x.max()
l = x.min()
Lower_Fence = builtins.max(builtins.min(x[x > q1 - (1.5 * IQR)], default=pd.Timestamp.min), l)
#Lower_Fence = builtins.max(q1 - (1.5 * IQR), l)
Upper_Fence = builtins.min(builtins.max(x[x < q3 + (1.5 * IQR)], default=pd.Timestamp.max), u)
#Upper_Fence = builtins.min(q3 + (1.5 * IQR), u)
return [Lower_Fence, q1, med, q3, Upper_Fence]
relevant_positions = iqr_fence(df_column)
n_items = len(df_column.index)
fig, axs = plt.subplots(3, sharex=True, figsize=(14, 6))
fig.suptitle(title)
# fixed bin
axs[0].hist(df_column, bins=n_bin, density=density, color=color)
axs[0].set_ylabel(str(n_bin) + ' bin')
axs[0].grid(axis='y')
if y_logscale:
axs[0].set_yscale('log')
# number of bins computed using Sturge's rule
n_bin = int(1 + math.log2(n_items))
axs[1].hist(df_column, bins=n_bin, density=density, color=color)
axs[1].set_ylabel("Sturge\'s rule binning")
if y_logscale:
axs[1].set_yscale('log')
axs[1].grid(axis='y')
axs[2].boxplot(x=df_column.dropna().values, labels=[''], vert=False)
axs[2].set_xlabel(y_label)
for i in range(2):
axs[i].axvline(x = relevant_positions[0], color = 'black', linestyle = '--', alpha=0.75)
axs[i].axvline(x = relevant_positions[1], color = 'black', linestyle = '-.', alpha=0.75)
axs[i].axvline(x = relevant_positions[2], color = 'black', linestyle = '-.', alpha=0.75)
axs[i].axvline(x = relevant_positions[3], color = 'black', linestyle = '-.', alpha=0.75)
axs[i].axvline(x = relevant_positions[4], color = 'black', linestyle = '--', alpha=0.75)
return fig
plot_hist(incidents_df['n_participants'], title='Distribution of number of participants', n_bin=104, y_label='n_participants', density=False, y_logscale=True);
plt.figure(figsize=(20, 5))
plt.hist(incidents_df['n_participants'], bins=104, edgecolor='black', linewidth=0.8)
plt.xlabel('Number of participants')
plt.ylabel('Frequency (log scale)')
plt.xticks(np.arange(1, 104, 2))
plt.yscale('log')
plt.title('Distribution of number of participants')
plt.show()
incidents_df[['n_participants', 'n_participants_child', 'n_participants_teen', 'n_participants_adult']].max().to_frame().rename(columns={0:'max value'})
| max value | |
|---|---|
| n_participants | 103.0 |
| n_participants_child | 7.0 |
| n_participants_teen | 27.0 |
| n_participants_adult | 103.0 |
incidents_df[incidents_df['n_participants_adult'] > 60][['n_participants', 'n_participants_adult',
'n_participants_child', 'n_participants_teen']]
| n_participants | n_participants_adult | n_participants_child | n_participants_teen | |
|---|---|---|---|---|
| 179866 | 63.0 | 63.0 | 0.0 | 0.0 |
| 235978 | 103.0 | 103.0 | 0.0 | 0.0 |
fig, (ax0, ax1, ax2) = plt.subplots(3, 1, figsize=(20, 12), sharex=True, sharey=True)
ax0.bar(incidents_df['n_participants_child'].value_counts().index, incidents_df['n_participants_child'].value_counts(),
alpha=0.8, color='magenta', edgecolor='black', linewidth=0.8, label='Children')
ax0.legend()
ax1.bar(incidents_df['n_participants_teen'].value_counts().index, incidents_df['n_participants_teen'].value_counts(),
alpha=0.8, color='red', edgecolor='black', linewidth=0.8, label='Teen')
ax1.legend()
ax2.bar(incidents_df['n_participants_adult'].value_counts().index, incidents_df['n_participants_adult'].value_counts(),
color='orange', edgecolor='black', linewidth=0.8, label='Adult')
ax2.legend()
plt.xlim(-1, 64)
plt.xticks(range(0, 64))
plt.yscale('log')
plt.xlabel('Number of participants')
ax0.set_ylabel('Number of incidents')
ax1.set_ylabel('Numer of incidents')
ax2.set_ylabel('Numer of incidents')
ax0.set_title('Number of participants for each incident per age')
plt.show()
We observe that in incidents involving children and teenagers under the age of 18, the total number of participants is smaller than 7 and 27, respectively. In general, incidents involving a single person are much more frequent than other incidents, and most often, they involve teenagers and children, with a smaller percentage involving adults. On the other hand, incidents with more than one participant mostly consist of adults, and as the number of participants increases, the frequency of such incidents decreases.
We also plot the distribution of the number of incidents per gender:
plt.figure(figsize=(20, 5))
plt.bar(incidents_df['n_males'].value_counts().index-0.2, incidents_df['n_males'].value_counts(), 0.4,
edgecolor='black', linewidth=0.8, label='Males participants')
plt.bar(incidents_df['n_females'].value_counts().index+0.2, incidents_df['n_females'].value_counts(), 0.4,
edgecolor='black', linewidth=0.8, label='Females participants')
plt.xticks(range(0, 64))
plt.yscale('log')
plt.xlabel('Number of participants')
plt.ylabel('Number of incidents')
plt.legend()
plt.title('Number of participants for each incident per gender')
plt.show()
Below, we plot the distribution of the average age of participants in each incident.
plot_hist(incidents_df['avg_age_participants'], y_label='avg_age_participants', density=False);
We use a word cloud to display the most frequent words in the attribut notes:
nltk.download('stopwords')
stopwords_set = set(stopwords.words('english'))
word_cloud_all_train = WordCloud(
width=800,
height=400,
stopwords=stopwords_set,
collocations=False,
background_color='white'
).generate(' '.join(incidents_df[incidents_df['notes'].notna()]['notes'].tolist()));
word_cloud_all_train.to_svg()
plt.figure( figsize=(20,10) )
plt.imshow(word_cloud_all_train)
plt.axis('off');
plt.title('Word cloud of notes\n', fontsize=40, fontweight='bold');
plt.savefig("../html/word_cloud_notes.svg")
[nltk_data] Downloading package stopwords to [nltk_data] /Users/irenetesta/nltk_data... [nltk_data] Package stopwords is already up-to-date!
We check if given the first characteristic of a record, the second one is different
incidents_df[incidents_df['incident_characteristics1']==incidents_df['incident_characteristics2']].shape[0]==0
True
We plot the frequency of each characteristic:
# merge characteristics list
ch1_counts = incidents_df['incident_characteristics1'].value_counts()
ch2_counts = incidents_df['incident_characteristics2'].value_counts()
ch_counts = ch1_counts.add(ch2_counts, fill_value=0).sort_values(ascending=True)
ch_counts.to_frame()
| 0 | |
|---|---|
| Police Targeted | 1.0 |
| Defensive Use - Crime occurs, victim shoots subject/suspect/perpetrator | 1.0 |
| Cleaning gun | 1.0 |
| Child with gun - no shots fired | 1.0 |
| Officer Involved Shooting - subject/suspect/perpetrator unarmed | 1.0 |
| ... | ... |
| Officer Involved Incident | 14571.0 |
| Shots Fired - No Injuries | 35700.0 |
| Non-Shooting Incident | 44665.0 |
| Shot - Dead (murder, accidental, suicide) | 53396.0 |
| Shot - Wounded/Injured | 93910.0 |
91 rows × 1 columns
fig = ch_counts.plot(kind='barh', figsize=(5, 18))
fig.set_xscale("log")
plt.title("Counts of 'incident_characteristics'")
plt.xlabel('Count')
plt.ylabel('Incident characteristics')
plt.tight_layout()
/var/folders/8n/f3j9y6bd3zd_kp83y0vvn_b80000gn/T/ipykernel_5016/2459086947.py:6: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
ch1_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics1'].value_counts()
ch2_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics2'].value_counts()
ch_females_counts = ch1_females_counts.add(ch2_females_counts, fill_value=0).sort_values(ascending=False).plot(
kind='bar',
title='Characteristics counts of incidents with females involved',
figsize=(20,10)
)
characteristics_count_matrix = pd.crosstab(incidents_df['incident_characteristics2'], incidents_df['incident_characteristics1'])
fig, ax = plt.subplots(figsize=(25, 20))
sns.heatmap(characteristics_count_matrix, cmap='coolwarm', ax=ax, xticklabels=True, yticklabels=True, linewidths=.5)
ax.set_xlabel('incident_characteristics1')
ax.set_ylabel('incident_characteristics2')
ax.set_title('Counts of incident characteristics')
plt.tight_layout()
characteristics_count_matrix[["Shot - Dead (murder, accidental, suicide)"]].sort_values(
by="Shot - Dead (murder, accidental, suicide)",
inplace=False,
ascending=False).plot(
kind='bar',
figsize=(20,10)
);
We defined the following binary tags to categorize the characteristics of each incident:
Each tag was set to True if and only if we had enough information to assume that the incident had that particular characteristic.
We set all the tags and check their consistency w.r.t. the other data:
from TASK_1.data_preparation_utils import add_tags, check_tag_consistency, check_characteristics_consistency, IncidentTag
tags_columns = [tag.name for tag in IncidentTag]
tags_columns.append('tag_consistency')
if LOAD_DATA_FROM_CHECKPOINT:
with zipfile.ZipFile('checkpoints/checkpoint_5.csv.zip', 'r') as zip_ref:
zip_ref.extractall('checkpoints/')
incidents_df = load_checkpoint('checkpoint_5', date_cols=['date', 'date_original'])
else:
incidents_df = add_tags(incidents_df)
incidents_df['tag_consistency'] = True
incidents_df = incidents_df.apply(lambda row: check_tag_consistency(row), axis=1)
incidents_df = incidents_df.apply(lambda row: check_characteristics_consistency(row), axis=1)
save_checkpoint(incidents_df, 'checkpoint_5')
incidents_df['tag_consistency'].value_counts().to_frame()
| tag_consistency | |
|---|---|
| True | 221461 |
| False | 17920 |
We correct the inconsistencies in the tag assuming the numerical data are consistent and we save again the dataset.
from TASK_1.data_preparation_utils import set_tags_consistent_data
if LOAD_DATA_FROM_CHECKPOINT:
with zipfile.ZipFile('checkpoints/checkpoint_6.csv.zip', 'r') as zip_ref:
zip_ref.extractall('checkpoints/')
incidents_df = load_checkpoint('checkpoint_6', date_cols=['date', 'date_original'])
else:
for index, record in incidents_df.iterrows():
if record['tag_consistency'] == False:
incidents_df.loc[index] = set_tags_consistent_data(record)
incidents_df.loc[index, 'tag_consistency'] = True # set to true and then check again if it's still not consistent
incidents_df = incidents_df.apply(lambda row: check_tag_consistency(row), axis=1)
incidents_df = incidents_df.apply(lambda row: check_characteristics_consistency(row), axis=1)
save_checkpoint(incidents_df, 'checkpoint_6')
pd.DataFrame(data=incidents_df.dtypes).T
| date | state | city_or_county | address | latitude | longitude | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | state_house_district | state_senate_district | congressional_district | participant1_child | participant1_teen | participant1_adult | participant1_male | participant1_female | firearm | air_gun | shots | aggression | suicide | injuries | death | road | illegal_holding | house | school | children | drugs | officers | organized | social_reasons | defensive | workplace | abduction | unintentional | tag_consistency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | datetime64[ns] | object | object | object | float64 | float64 | float64 | object | object | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | object | object | object | float64 | datetime64[ns] | int64 | object | int64 | int64 | object | object | object | int64 | int64 | int64 | float64 | object | float64 | float64 | float64 | object | object | object | object | object | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool | bool |
incidents_df['tag_consistency'].value_counts().to_frame()
| tag_consistency | |
|---|---|
| True | 221461 |
| False | 17920 |
We display the frequencies of the tags grouping them:
tags_counts = {}
tags_counts['Murder'] = incidents_df[
(incidents_df['death']==True) &
((incidents_df['aggression']==True) |
(incidents_df['social_reasons']==True))].shape[0] # not accidental nor defensive
tags_counts['Suicide'] = incidents_df[
(incidents_df['death']==True) &
(incidents_df['suicide']==True)].shape[0] # warninig: if murder/suicide is counted twice
tags_counts['Defensive'] = incidents_df[
(incidents_df['death']==True) &
(incidents_df['defensive']==True)].shape[0]
tags_counts['Accidental'] = incidents_df[
(incidents_df['death']==True) &
(incidents_df['unintentional']==True)].shape[0]
tags_counts['Others or not known'] = incidents_df[
(incidents_df['death']==True) &
(incidents_df['aggression']==False) &
(incidents_df['social_reasons']==False) &
(incidents_df['unintentional']==False)].shape[0]
fig, ax = plt.subplots()
total = sum(tags_counts.values())
ax.pie(tags_counts.values())
legend_labels = [f'{label}: {(size/total)*100:.1f}%' for label, size in tags_counts.items()]
plt.legend(legend_labels)
plt.title("Gun incidents")
plt.show()
fig.savefig("../html/pie_incident_type.svg")
Most of the incidents involved Murder. Suicide, Defensive and Accidental are very few compare to murders. The other big slice of the pie belongs to 'Others', showing that there are a lot of different incidents that are less common.
We show the frequency of the values of each singular tag:
ax = (incidents_df[tags_columns].apply(lambda col: col.value_counts()).T.sort_values(by=True)/incidents_df.shape[0]*100).plot(kind='barh', stacked=True, alpha=0.8, edgecolor='black')
for container in ax.containers:
ax.bar_label(container, fmt='%.1f%%', label_type='center', fontsize=8)
plt.title("Incidents characteristic (%)")
Text(0.5, 1.0, 'Incidents characteristic (%)')
The most common tags are firearm, shots, aggression and injuries (above 50% of the records), in particular firearm is True for almost every record (97.8 %). On the other hand there are tags (air_gun, school, social_reasons and abduction) that are very rare.
We check for correlations between accidental incidents and the presence of children.
incidents_df['unintentional'].corr(incidents_df['n_participants_child']>0)
0.15398486295938882
The two events are not correlated.
We display the most common characteristics for incidents involving women.
ch1_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics1'].value_counts()
ch2_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics2'].value_counts()
ch_females_counts = ch1_females_counts.add(ch2_females_counts, fill_value=0).sort_values(ascending=False).plot(
kind='bar',
title='Characteristics counts of incidents with females involved',
figsize=(20,10)
)
The distribution is very similar to the one involving both men and women. Some of the main differences are that, for women, the frequency of suicides is higher, while the involvemnte of officiers is lower.
We plot on a map the location of mass shootings, incidents involving children and suicides:
plot_scattermap_plotly(
incidents_df[incidents_df['n_killed']>=4],
zoom=2,
title='Mass shootings'
)
plot_scattermap_plotly(
incidents_df[incidents_df['children']==True],
zoom=2,
title='Incidents involving children'
)
plot_scattermap_plotly(
incidents_df[incidents_df['suicide']==True],
zoom=2,
title='Suicides'
)
The dislocation of these kind of incidents is similar to the one of the whole dataset.
We join the poverty data with the incidents data:
poverty_df['state'] = poverty_df['state'].str.upper()
incidents_df = incidents_df.merge(poverty_df, on=['state', 'year'], how='left', validate="m:1")
incidents_df.head()
| date | state | city_or_county | address | latitude | longitude | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | state_house_district | state_senate_district | congressional_district | participant1_child | participant1_teen | participant1_adult | participant1_male | participant1_female | firearm | air_gun | shots | aggression | suicide | injuries | death | road | illegal_holding | house | school | children | drugs | officers | organized | social_reasons | defensive | workplace | abduction | unintentional | tag_consistency | povertyPercentage | px_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-05-02 | INDIANA | Indianapolis | Lafayette Road and Pike Plaza | 39.8322 | -86.2492 | 19.0 | Adult 18+ | Male | 19.0 | 19.0 | 19.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | Teen wounded while walking - Security guard at nearby carnival aided victim. | Shot - Wounded/Injured | NaN | 2015.0 | 2015-05-02 | 5 | May | 2 | 5 | Saturday | Marion County | Indianapolis | 1 | 1 | 1 | 1.000100e-01 | road | 94.0 | 33.0 | 7.0 | False | False | True | True | False | True | False | True | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 12.3 | IN |
| 1 | 2017-04-03 | PENNSYLVANIA | Kane | 5647 US 6 | 41.6645 | -78.7856 | 62.0 | Adult 18+ | Male | 62.0 | 62.0 | 62.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | shot self after accident | Shot - Dead (murder, accidental, suicide) | Suicide^ | 2017.0 | 2017-04-03 | 4 | April | 3 | 0 | Monday | McKean County | Wetmore Township | 1 | -1 | 1 | 1.000100e-01 | road | NaN | NaN | 5.0 | False | False | True | True | False | True | False | True | False | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 10.5 | PA |
| 2 | 2016-11-05 | MICHIGAN | Detroit | 6200 Block of East McNichols Road | 42.4190 | -83.0393 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 1 inj. | Shot - Wounded/Injured | NaN | 2016.0 | 2016-11-05 | 11 | November | 5 | 5 | Saturday | Wayne County | Detroit | 1 | 1 | 1 | 1.000000e-07 | place | 4.0 | 2.0 | 14.0 | NaN | NaN | NaN | NaN | NaN | True | False | True | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 11.0 | MI |
| 3 | 2016-10-15 | DISTRICT OF COLUMBIA | Washington | 1000 block of Bladensburg Road, NE | 38.9030 | -76.9820 | NaN | Adult 18+ | Male | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | NaN | Shot - Wounded/Injured | NaN | 2016.0 | 2016-10-15 | 10 | October | 15 | 5 | Saturday | Trinidad | Washington | 1 | 1 | 1 | 1.000000e-05 | building | NaN | NaN | 0.0 | False | False | True | True | False | True | False | True | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 14.9 | DC |
| 4 | NaT | PENNSYLVANIA | Pittsburgh | California and Marshall Avenues | 40.4621 | -80.0308 | NaN | Adult 18+ | Male | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | NaN | Shot - Wounded/Injured | Drive-by (car to street, car to car) | NaN | 2030-06-14 | 6 | June | 14 | 4 | Friday | Allegheny County | Pittsburgh | 1 | 1 | 1 | 1.000100e-01 | road | NaN | NaN | 14.0 | False | False | True | True | False | True | False | True | True | False | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True | NaN | NaN |
We join the elections data with the incidents data:
elections_df_copy = elections_df.copy()
elections_df_copy['year'] = elections_df_copy['year'] + 1
elections_df = pd.concat([elections_df, elections_df_copy], ignore_index=True)
incidents_df = incidents_df.merge(elections_df, on=['state', 'year', 'congressional_district'], how='left')
incidents_df.head()
| date | state | city_or_county | address | latitude | longitude | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | state_house_district | state_senate_district | congressional_district | participant1_child | participant1_teen | participant1_adult | participant1_male | participant1_female | firearm | air_gun | shots | aggression | suicide | injuries | death | road | illegal_holding | house | school | children | drugs | officers | organized | social_reasons | defensive | workplace | abduction | unintentional | tag_consistency | povertyPercentage | px_code | party | candidatevotes | totalvotes | candidateperc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2015-05-02 | INDIANA | Indianapolis | Lafayette Road and Pike Plaza | 39.8322 | -86.2492 | 19.0 | Adult 18+ | Male | 19.0 | 19.0 | 19.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | Teen wounded while walking - Security guard at nearby carnival aided victim. | Shot - Wounded/Injured | NaN | 2015.0 | 2015-05-02 | 5 | May | 2 | 5 | Saturday | Marion County | Indianapolis | 1 | 1 | 1 | 1.000100e-01 | road | 94.0 | 33.0 | 7.0 | False | False | True | True | False | True | False | True | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 12.3 | IN | DEMOCRAT | 61443.0 | 112261.0 | 54.732276 |
| 1 | 2017-04-03 | PENNSYLVANIA | Kane | 5647 US 6 | 41.6645 | -78.7856 | 62.0 | Adult 18+ | Male | 62.0 | 62.0 | 62.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | shot self after accident | Shot - Dead (murder, accidental, suicide) | Suicide^ | 2017.0 | 2017-04-03 | 4 | April | 3 | 0 | Monday | McKean County | Wetmore Township | 1 | -1 | 1 | 1.000100e-01 | road | NaN | NaN | 5.0 | False | False | True | True | False | True | False | True | False | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 10.5 | PA | REPUBLICAN | 206761.0 | 307843.0 | 67.164431 |
| 2 | 2016-11-05 | MICHIGAN | Detroit | 6200 Block of East McNichols Road | 42.4190 | -83.0393 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | 1 inj. | Shot - Wounded/Injured | NaN | 2016.0 | 2016-11-05 | 11 | November | 5 | 5 | Saturday | Wayne County | Detroit | 1 | 1 | 1 | 1.000000e-07 | place | 4.0 | 2.0 | 14.0 | NaN | NaN | NaN | NaN | NaN | True | False | True | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 11.0 | MI | DEMOCRAT | 244135.0 | 310974.0 | 78.506563 |
| 3 | 2016-10-15 | DISTRICT OF COLUMBIA | Washington | 1000 block of Bladensburg Road, NE | 38.9030 | -76.9820 | NaN | Adult 18+ | Male | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | NaN | Shot - Wounded/Injured | NaN | 2016.0 | 2016-10-15 | 10 | October | 15 | 5 | Saturday | Trinidad | Washington | 1 | 1 | 1 | 1.000000e-05 | building | NaN | NaN | 0.0 | False | False | True | True | False | True | False | True | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 14.9 | DC | DEMOCRAT | 265178.0 | 300906.0 | 88.126525 |
| 4 | NaT | PENNSYLVANIA | Pittsburgh | California and Marshall Avenues | 40.4621 | -80.0308 | NaN | Adult 18+ | Male | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 2.0 | NaN | Shot - Wounded/Injured | Drive-by (car to street, car to car) | NaN | 2030-06-14 | 6 | June | 14 | 4 | Friday | Allegheny County | Pittsburgh | 1 | 1 | 1 | 1.000100e-01 | road | NaN | NaN | 14.0 | False | False | True | True | False | True | False | True | True | False | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True | NaN | NaN | NaN | NaN | NaN | NaN |
We read and join the data about the USA population from the 2010 census downloaded from Wikipedia.
usa_population_df = pd.read_csv(DATA_FOLDER_PATH + 'external_data/2010_United_States_census.csv')
usa_population_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 52 entries, 0 to 51 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 51 non-null object 1 State 52 non-null object 2 Population as of 2010 census 52 non-null object 3 Population as of 2000 census 52 non-null object 4 Change 52 non-null object 5 Percent change 52 non-null object dtypes: object(6) memory usage: 2.6+ KB
usa_population_df.drop(columns=['Population as of 2000 census', 'Change', 'Percent change', 'Rank'], inplace=True)
usa_population_df.rename(columns={'Population as of 2010 census':'population_state_2010', 'State': 'state'}, inplace=True)
usa_population_df['state'] = usa_population_df['state'].str.upper()
usa_population_df['population_state_2010'] = usa_population_df['population_state_2010'].str.replace(',', '').astype('int64')
incidents_df = incidents_df.merge(usa_population_df, on=['state'], how='left')
incidents_df.sample(5, random_state=1)
| date | state | city_or_county | address | latitude | longitude | participant_age1 | participant_age_group1 | participant_gender1 | min_age_participants | avg_age_participants | max_age_participants | n_participants_child | n_participants_teen | n_participants_adult | n_males | n_females | n_killed | n_injured | n_arrested | n_unharmed | n_participants | notes | incident_characteristics1 | incident_characteristics2 | year | date_original | month | month_name | day | day_of_week | day_of_week_name | county | city | state_consistency | county_consistency | address_consistency | location_importance | address_type | state_house_district | state_senate_district | congressional_district | participant1_child | participant1_teen | participant1_adult | participant1_male | participant1_female | firearm | air_gun | shots | aggression | suicide | injuries | death | road | illegal_holding | house | school | children | drugs | officers | organized | social_reasons | defensive | workplace | abduction | unintentional | tag_consistency | povertyPercentage | px_code | party | candidatevotes | totalvotes | candidateperc | population_state_2010 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 143491 | 2016-05-23 | FLORIDA | Port Charlotte | Conway Boulevard | 26.9829 | -82.0881 | 18.0 | Adult 18+ | Male | 18.0 | 19.0 | 20.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | shot at bicyclist from blue vehicle | Shots Fired - No Injuries | Drive-by (car to street, car to car) | 2016.0 | 2016-05-23 | 5 | May | 23 | 0 | Monday | Charlotte County | Port Charlotte | 1 | 1 | 1 | 1.000000e-07 | place | 75.0 | 28.0 | 17.0 | False | False | True | True | False | True | False | True | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True | 13.6 | FL | REPUBLICAN | 209348.0 | 338675.0 | 61.813833 | 18801310 |
| 110344 | 2017-02-11 | PENNSYLVANIA | Harrisburg | 2400 block of Berryhill St | 40.2645 | -76.8475 | 24.0 | Adult 18+ | Male | 24.0 | 24.0 | 24.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | Shot - Dead (murder, accidental, suicide) | Suicide^ | 2017.0 | 2017-02-11 | 2 | February | 11 | 5 | Saturday | Dauphin County | Harrisburg | 1 | 1 | 1 | 1.000000e-07 | place | NaN | NaN | 4.0 | False | False | True | True | False | True | False | True | False | True | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | True | 10.5 | PA | REPUBLICAN | 220628.0 | 334000.0 | 66.056287 | 12702379 |
| 62395 | 2016-07-30 | FLORIDA | Jacksonville Beach | 2nd Street and 3rd Avenue | 30.2855 | -81.3900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | across from Ritz Lounge, casings found | Institution/Group/Business | Shots Fired - No Injuries | 2016.0 | 2016-07-30 | 7 | July | 30 | 5 | Saturday | Duval County | Jacksonville Beach | 1 | 1 | 1 | 1.000000e-07 | place | 11.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | True | False | True | True | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | 13.6 | FL | REPUBLICAN | 287509.0 | 409662.0 | 70.182004 | 18801310 |
| 159518 | 2015-12-19 | NEBRASKA | Omaha | 12441 Miami St. | 41.2834 | -96.1073 | 24.0 | Adult 18+ | Male | 24.0 | 24.0 | 24.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | Shot - Dead (murder, accidental, suicide) | Accidental Shooting | 2015.0 | 2015-12-19 | 12 | December | 19 | 5 | Saturday | Douglas County | Omaha | 1 | 1 | 1 | 1.000000e-07 | place | NaN | 6.0 | 2.0 | False | False | True | True | False | True | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | True | True | 10.1 | NE | DEMOCRAT | 83872.0 | 171509.0 | 48.902390 | 1826341 |
| 26161 | NaT | NORTH CAROLINA | Salisbury | 1505 E. Innes Street | 35.6533 | -80.4572 | NaN | Adult 18+ | Male | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | Walgreens south of Interstate 85 | Armed robbery with injury/death and/or evidence of DGU found | Institution/Group/Business | NaN | 2029-06-04 | 6 | June | 4 | 0 | Monday | Rowan County | Salisbury | 1 | 1 | 1 | 1.000000e-05 | shop | 76.0 | 34.0 | 12.0 | False | False | True | True | False | True | False | False | True | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 9535483 |
We plot the number of incidents per state per 100k inhabitants:
incidents_per_state = incidents_df[incidents_df['year']<=2020].groupby(['state', 'population_state_2010']).size()
incidents_per_state = ((incidents_per_state / incidents_per_state.index.get_level_values('population_state_2010'))*100000).to_frame(name='incidents_per_100k_inhabitants').sort_values(by='incidents_per_100k_inhabitants', ascending=True)
incidents_per_state.reset_index(inplace=True)
incidents_per_state.plot(
kind='barh',
x='state',
y='incidents_per_100k_inhabitants',
figsize=(15, 10),
ylabel='State',
xlabel='Incidents per 100k inhabitants',
title='Incidents per 100k inhabitants per state'
)
<Axes: title={'center': 'Incidents per 100k inhabitants per state'}, xlabel='Incidents per 100k inhabitants', ylabel='State'>
District of Columbia has the highest number of incidents per 100k inhabitants.
We display the tag frequency for the District of Columbia:
incidents_df[incidents_df['state']=='DISTRICT OF COLUMBIA']['incident_characteristics1'].value_counts().plot(kind='barh', figsize=(20, 10))
<Axes: >
incidents_df[(incidents_df['state']=='DISTRICT OF COLUMBIA')&(incidents_df['year']<2014)].shape
(5, 75)
We visualize the number of incidents happened in each state every month:
incidents_df['year'] = incidents_df['year'].astype('UInt64')
incidents_per_month_per_state = incidents_df.groupby(['state', 'month', 'year']).size()
incidents_per_month_per_state = incidents_per_month_per_state.to_frame(name='incidents').reset_index()
incidents_per_month_per_state = incidents_per_month_per_state.sort_values(by=['year', 'month', 'state'], ignore_index=True)
incidents_per_month_per_state['incidents_per_100k_inhabitants'] = incidents_per_month_per_state.apply(
lambda row: (row['incidents'] / usa_population_df[usa_population_df['state']==row['state']]['population_state_2010'].iloc[0])*100000,
axis=1
)
fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(
incidents_per_month_per_state[incidents_per_month_per_state.year<=2020].pivot(
index='state',
columns=['year', 'month'],
values='incidents_per_100k_inhabitants'
).fillna(0),
cmap='coolwarm',
ax=ax,
xticklabels=True,
yticklabels=True,
linewidths=.5
)
ax.set_xlabel('Year-Month')
ax.set_ylabel('State')
ax.set_title('Number of incidents per 100k inhabitants')
plt.xticks(rotation=90)
plt.tight_layout()
incidents_per_month_per_state = incidents_df[incidents_df['incident_characteristics1']!='Non-Shooting Incident'].groupby(['state', 'month', 'year']).size()
incidents_per_month_per_state = incidents_per_month_per_state.to_frame(name='incidents').reset_index()
incidents_per_month_per_state = incidents_per_month_per_state.sort_values(by=['year', 'month', 'state'], ignore_index=True)
incidents_per_month_per_state['incidents_per_100k_inhabitants'] = incidents_per_month_per_state.apply(
lambda row: (row['incidents'] / usa_population_df[usa_population_df['state']==row['state']]['population_state_2010'].iloc[0])*100000,
axis=1
)
fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(
incidents_per_month_per_state[incidents_per_month_per_state.year<=2020].pivot(
index='state',
columns=['year', 'month'],
values='incidents_per_100k_inhabitants'
).fillna(0),
cmap='coolwarm',
ax=ax,
xticklabels=True,
yticklabels=True,
linewidths=.5
)
ax.set_xlabel('Year-Month')
ax.set_ylabel('State')
ax.set_title('Number of incidents per 100k inhabitants (excluding non-shooting incidents)')
plt.xticks(rotation=90)
plt.tight_layout()
We exclude District of Columbia:
fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(
incidents_per_month_per_state[(incidents_per_month_per_state.year<=2020) & (incidents_per_month_per_state['state']!='DISTRICT OF COLUMBIA')].pivot(
index='state',
columns=['year', 'month'],
values='incidents_per_100k_inhabitants'
).fillna(0),
cmap='coolwarm',
ax=ax,
xticklabels=True,
yticklabels=True,
linewidths=.5
)
ax.set_xlabel('Year-Month')
ax.set_ylabel('State')
ax.set_title('Number of incidents per 100k inhabitants')
plt.xticks(rotation=90)
plt.tight_layout()
fig.savefig("../html/heatmap_incidents_months.svg")
We plot the correlations between the number of incidents and the poverty percentage in each state coloring the points according to the party that got the majority of votes in that state:
winning_party_per_state = pd.read_csv(DATA_FOLDER_PATH + '../data/winning_party_per_state.csv')
winning_party_per_state_copy = winning_party_per_state.copy()
winning_party_per_state_copy['year'] = winning_party_per_state['year'] + 1
winning_party_per_state = pd.concat([winning_party_per_state, winning_party_per_state_copy], ignore_index=True)
incidents_df = incidents_df[incidents_df['year'].notna()].merge(winning_party_per_state[['state', 'year', 'majority_state_party']], on=['state', 'year'], how='left')
incidents_per_state_2016 = incidents_df[(incidents_df['n_killed']>0)].groupby(['state', 'year', 'population_state_2010', 'povertyPercentage', 'majority_state_party']).size()
incidents_per_state_2016 = incidents_per_state_2016.to_frame(name='incidents').reset_index()
incidents_per_state_2016['incidents_per_100k_inhabitants'] = (incidents_per_state_2016['incidents'] / incidents_per_state_2016['population_state_2010'])*100000
fig = px.scatter(
incidents_per_state_2016,
x='povertyPercentage',
y='incidents_per_100k_inhabitants',
color='majority_state_party',
hover_name='state',
hover_data={'povertyPercentage': True, 'incidents_per_100k_inhabitants': True},
title='Mortal gun incidents in the USA',
facet_col="year",
facet_col_wrap=3,
height=800
)
pyo.plot(fig, filename='../html/scatter_poverty.html', auto_open=False)
fig.show()
The two attributes are slightly correlated.
We plot the correlations between each attribute:
numerical_columns = incidents_df.select_dtypes(include=['float64', 'int64']).columns
fig = plt.figure(figsize=(15, 12))
corr_matrix = incidents_df[numerical_columns].corr(method='pearson')
sns.heatmap(corr_matrix, mask=np.triu(corr_matrix));
plt.title('Correlation matrix (Pearson)');
fig.savefig("../html/attributes_correlation_matrix.svg")
The attributes min_age_participants, avg_age_participants and max_age_participants are positively correlated with each other (probably because there are many incidents involving a single person). n_participants is positively correlated with n_participants_adults and n_males: most of the incidents involve adult males. povertyPercentage is inversly correlated with latitude: southern states are poorer.
We re-order the columns and we save the cleaned dataset:
time_columns = ['date', 'date_original', 'year', 'month', 'day', 'day_of_week']
geo_columns = ['state', 'address', 'latitude', 'longitude',
'county', 'city', 'location_importance', 'address_type',
'congressional_district', 'state_house_district', 'state_senate_district',
'px_code']
participants_columns = ['participant_age1', 'participant1_child',
'participant1_teen', 'participant1_adult', 'participant1_male',
'participant1_female', 'min_age_participants', 'avg_age_participants', 'max_age_participants',
'n_participants_child', 'n_participants_teen', 'n_participants_adult',
'n_males', 'n_females', 'n_killed', 'n_injured', 'n_arrested',
'n_unharmed', 'n_participants']
characteristic_columns = ['notes', 'incident_characteristics1', 'incident_characteristics2',
'firearm', 'air_gun', 'shots', 'aggression', 'suicide', 'injuries',
'death', 'road', 'illegal_holding', 'house', 'school', 'children',
'drugs', 'officers', 'organized', 'social_reasons', 'defensive',
'workplace', 'abduction', 'unintentional']
external_columns = ['povertyPercentage', 'party', 'candidatevotes', 'totalvotes', 'candidateperc', 'population_state_2010']
incidents_df = incidents_df[time_columns + geo_columns + participants_columns + characteristic_columns + external_columns]
incidents_df = incidents_df.rename(
columns={
'povertyPercentage': 'poverty_perc',
'candidatevotes': 'candidate_votes',
'totalvotes': 'total_votes',
'candidateperc': 'candidate_perc'
}
)
incidents_df.to_csv(DATA_FOLDER_PATH +'incidents_cleaned.csv', index=True)